Hpc compass transtec_2012
Upcoming SlideShare
Loading in...5

Hpc compass transtec_2012



Our new HPC compass is available for download!!

Our new HPC compass is available for download!!



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

Hpc compass transtec_2012 Hpc compass transtec_2012 Document Transcript

  • AutomotiveSimulation Risk Analysis High Throughput Computing Price Modelling EngineeringHIGH CAE AerospacePERFORMANCECOMPUTING 2012/13TECHNOLOGYCOMPASS CAD Big Data Analytics Life Sciences
  • TECHNOLOGY COMPASS INTEL CLUSTER READY ............................................................................62 A Quality Standard for HPC Clusters...................................................... 64TABLE OF CONTENTS AND INTRODUCTION Intel Cluster Ready builds HPC Momentum ..................................... 69 The transtec Benchmarking Center ....................................................... 73HIGH PERFORMANCE COMPUTING .................................................... 4 WINDOWS HPC SERVER 2008 R2 ........................................................74Performance Turns Into Productivity ......................................................6 Elements of the Microsoft HPC Solution ............................................ 76Flexible deployment with xCAT ...................................................................8 Deployment, system management, and monitoring ................. 78 Job scheduling..................................................................................................... 80CLUSTER MANAGEMENT MADE EASY ..............................................12 Service-oriented architecture ................................................................... 82Bright Cluster Manager ................................................................................. 14 Networking and MPI ........................................................................................ 85 Microsoft Office Excel support ................................................................. 88INTELLIGENT HPC WORKLOAD MANAGEMENT .........................28Moab HPC Suite – Enterprise Edition.................................................... 30 PARALLEL NFS ...............................................................................................90New in Moab 7.0 ................................................................................................. 34 The New Standard for HPC Storage ....................................................... 92Moab HPC Suite – Basic Edition................................................................ 37 Whats´s new in NFS 4.1? ............................................................................... 94Moab HPC Suite - Grid Option .................................................................... 43 Panasas HPC Storage ...................................................................................... 99NICE ENGINE FRAME .................................................................................50 NVIDIA GPU COMPUTING ....................................................................110A technical portal for remote visualization ...................................... 52 The CUDA Architecture ............................................................................... 112Application highlights.................................................................................... 54 Codename “Fermi” ......................................................................................... 116Desktop Cloud Virtualization .................................................................... 57 Introducing NVIDIA Parallel Nsight ..................................................... 122Remote Visualization...................................................................................... 58 QLogic TrueScale InfiniBand and GPUs ............................................ 126 INFINIBAND .................................................................................................130 High-speed interconnects ........................................................................ 132 Top 10 Reasons to Use QLogic TrueScale InfiniBand ................ 136 Intel MPI Library 4.0 Performance ........................................................ 139 InfiniBand Fabric Suite (IFS) – What’s New in Version 6.0 ...... 141 PARSTREAM .................................................................................................144 Big Data Analytics .......................................................................................... 146 GLOSSARY .....................................................................................................156 2
  • MORE THAN 30 YEARS OF EXPERIENCE IN SCIENTIFIC COMPUTING environment is of a highly heterogeneous nature. Even the1980 marked the beginning of a decade where numerous startups dynamical provisioning of HPC resources as needed does notwere created, some of which later transformed into big players in constitute any problem, thus further leading to maximal utiliza-the IT market. Technical innovations brought dramatic changes tion of the cluster.to the nascent computer market. In Tübingen, close to one of Ger-many’s prime and oldest universities, transtec was founded. transtec HPC solutions use the latest and most innovative technology. Their superior performance goes hand in hand withIn the early days, transtec focused on reselling DEC computers energy efficiency, as you would expect from any leading edge ITand peripherals, delivering high-performance workstations to solution. We regard these basic characteristics.university institutes and research facilities. In 1987, SUN/Sparcand storage solutions broadened the portfolio, enhanced by This brochure focusses on where transtec HPC solutions excel.IBM/RS6000 products in 1991. These were the typical worksta- To name a few: Bright Cluster Manager as the technology leadertions and server systems for high performance computing then, for unified HPC cluster management, leading-edge Moab HPCused by the majority of researchers worldwide. Suite for job and workload management, Intel Cluster Ready certification as an independent quality standard for our sys-In the late 90s, transtec was one of the first companies to offer tems, Panasas HPC storage systems for highest performancehighly customized HPC cluster solutions based on standard and best scalability required of an HPC storage system. Again,Intel architecture servers, some of which entered the TOP500 with these components, usability and ease of managementlist of the world’s fastest computing systems. are central issues that are addressed. Also, being NVIDIA Tesla Preferred Provider, transtec is able to provide customers withThus, given this background and history, it is fair to say that well-designed, extremely powerful solutions for Tesla GPUtranstec looks back upon a more than 30 years’ experience in computing. QLogic’s InfiniBand Fabric Suite makes managing ascientific computing; our track record shows nearly 500 HPC large InfiniBand fabric easier than ever before – transtec mas-installations. With this experience, we know exactly what cus- terly combines excellent and well-chosen components that aretomers’ demands are and how to meet them. High performance already there to a fine-tuned, customer-specific, and thoroughlyand ease of management – this is what customers require to- designed HPC solution.day. HPC systems are for sure required to peak-perform, as theirname indicates, but that is not enough: they must also be easy Last but not least, your decision for a transtec HPC solutionto handle. Unwieldy design and operational complexity must be means you opt for most intensive customer care and best ser-avoided or at least hidden from administrators and particularly vice in HPC. Our experts will be glad to bring in their expertiseusers of HPC computer systems. and support to assist you at any stage, from HPC design to daily cluster operations, to HPC Cloud Services.transtec HPC solutions deliver ease of management, both in theLinux and Windows worlds, and even where the customer´s Have fun reading the transtec HPC Compass 2012/13! 3
  • High Performance Computing (HPC) has been with us from the verybeginning of the computer era. High-performance computers werebuilt to solve numerous problems which the “human computers” couldnot handle. The term HPC just hadn’t been coined yet. More important,some of the early principles have changed fundamentally.HPC systems in the early days were much different from those we seetoday. First, we saw enormous mainframes from large computer manu-facturers, including a proprietary operating system and job managementsystem. Second, at universities and research institutes, workstationsmade inroads and scientists carried out calculations on their dedicatedUnix or VMS workstations. In either case, if you needed more computingpower, you scaled up, i.e. you bought a bigger machine.Today the term High-Performance Computing has gained a fundamen-tally new meaning. HPC is now perceived as a way to tackle complexmathematical, scientific or engineering problems. The integration ofindustry standard, “off-the-shelf” server hardware into HPC clusters fa-cilitates the construction of computer networks of such power that onesingle system could never achieve. The new paradigm for parallelizationis scaling out. 5
  • HIGH PERFORMANCE COMPUTING Computer-supported simulations of realistic processes (so- called Computer Aided Engineering – CAE) has established itselfPERFORMANCE TURNS INTO PRODUCTIVITY as a third key pillar in the field of science and research along- side theory and experimentation. It is nowadays inconceivable that an aircraft manufacturer or a Formula One racing team would operate without using simulation software. And scien- tific calculations, such as in the fields of astrophysics, medicine, pharmaceuticals and bio-informatics, will to a large extent be dependent on supercomputers in the future. Software manu- facturers long ago recognized the benefit of high-performance computers based on powerful standard servers and ported their programs to them accordingly. The main advantages of scale-out supercomputers is just that: they are infinitely scalable, at least in principle. Since they are based on standard hardware components, such a supercomputer can be charged with more power whenever the computational capacity of the system is not sufficient any more, simply by adding additional nodes of the same kind. A “transtec HPC solutions are meant to provide cumbersome switch to a different technology can be avoided customers with unparalleled ease-of-manage- in most cases. ment and ease-of-use. Apart from that, deciding for a transtec HPC solution means deciding for The primary rationale in using HPC clusters is to grow, to scale the most intensive customer care and the best out computing capacity as far as necessary. To reach that goal, service imaginable” an HPC cluster returns most of the invest when it is continu- ously fed with computing problems. Dr. Oliver Tennert Director Technology Management & HPC Solutions The secondary reason for building scale-out supercomputers is to maximize the utilization of the system. 6
  • If the individual processes engage in a large amount of com- munication, the response time of the network (latency) becomes important. Latency in a Gigabit Ethernet or a 10GE network is typi- cally around 10 µs. High-speed interconnects such as InfiniBand, reduce latency by a factor of 10 down to as low as 1 µs. Therefore, high-speed interconnects can greatly speed up total processing. The other frequently used variant is called SMP applications.VARIATIONS ON THE THEME: MPP AND SMP SMP, in this HPC context, stands for Shared Memory Processing.Parallel computations exist in two major variants today. Ap- It involves the use of shared memory areas, the specific imple-plications running in parallel on multiple compute nodes are mentation of which is dependent on the choice of the underlyingfrequently so-called Massively Parallel Processing (MPP) applica- operating system. Consequently, SMP jobs generally only run ontions. MPP indicates that the individual processes can each a single node, where they can in turn be multi-threaded and thusutilize exclusive memory areas. This means that such jobs are be parallelized across the number of CPUs per node. For many HPCpredestined to be computed in parallel, distributed across the applications, both the MPP and SMP variant can be chosen.nodes in a cluster. The individual processes can thus utilize theseparate units of the respective node – especially the RAM, the Many applications are not inherently suitable for parallel execu-CPU power and the disk I/O. tion. In such a case, there is no communication between the in- dividual compute nodes, and therefore no need for a high-speedCommunication between the individual processes is imple- network between them; nevertheless, multiple computing jobsmented in a standardized way through the MPI software can be run simultaneously and sequentially on each individualinterface (Message Passing Interface), which abstracts the node, depending on the number of CPUs.underlying network connections between the nodes fromthe processes. However, the MPI standard (current version In order to ensure optimum computing performance for these2.0) merely requires source code compatibility, not binary applications, it must be examined how many CPUs and corescompatibility, so an off-the-shelf application usually needs deliver the optimum performance.specific versions of MPI libraries in order to run. Examples ofMPI implementations are OpenMPI, MPICH2, MVAPICH2, Intel We find applications of this sequential type of work typically inMPI or – for Windows clusters – MS-MPI. the fields of data analysis or Monte-Carlo simulations. 7
  • HIGH PERFORMANCE COMPUTINGFLEXIBLE DEPLOYMENT WITH XCAT xCAT as a Powerful and Flexible Deployment Tool xCAT (Extreme Cluster Administration Tool) is an open source toolkit for the deployment and low-level administration of HPC cluster environments, small as well as large ones. xCAT provides simple commands for hardware control, node dis- covery, the collection of MAC addresses, and the node deploy- ment with (diskful) or without local (diskless) installation. The cluster configuration is stored in a relational database. Node groups for different operating system images can be defined. Also, user-specific scripts can be executed automatically at installation time. xCAT Provides the Following Low-Level Administrative Features  Remote console support  Parallel remote shell and remote copy commands  Plugins for various monitoring tools like Ganglia or Nagios  Hardware control commands for node discovery, collect- ing MAC addresses, remote power switching and resetting of nodes 8
  •  Automatic configuration of syslog, remote shell, DNS, DHCP, when the code is self-developed, developers often prefer one and ntp within the cluster MPI implementation over another. Extensive documentation and man pages According to the customer’s wishes, we install various compil-For cluster monitoring, we install and configure the open ers, MPI middleware, as well as job management systems likesource tool Ganglia or the even more powerful open source Parastation, Grid Engine, Torque/Maui, or the very powerfulsolution Nagios, according to the customer’s preferences and Moab HPC Suite for the high-level cluster management.requirements.Local Installation or Diskless InstallationWe offer a diskful or a diskless installation of the cluster nodes.A diskless installation means the operating system is hostedpartially within the main memory, larger parts may or maynot be included via NFS or other means. This approach allowsfor deploying large amounts of nodes very efficiently, and thecluster is up and running within a very small timescale. Also,updating the cluster can be done in a very efficient way. Forthis, only the boot image has to be updated, and the nodes haveto be rebooted. After this, the nodes run either a new kernel oreven a new operating system. Moreover, with this approach,partitioning the cluster can also be very efficiently done, eitherfor testing purposes, or for allocating different cluster parti-tions for different users or applications.Development Tools, Middleware, and ApplicationsAccording to the application, optimization strategy, or underlyingarchitecture, different compilers lead to code results of verydifferent performance. Moreover, different, mainly commercial,applications, require different MPI implementations. And even 9
  • HPC solution benchmarking of applicationHIGH PERFORMANCE COMPUTING different systems installationPERFORMANCE TURNS INTO PRODUCTIVITY continual improvement maintenance, integration onsite customer into support & hardware training customer’s managed services assembly environmentSERVICES AND CUSTOMER CARE FROM A TO Z application-, burn-in tests software individual Presales customer-, of systems & OS consulting site-specific installation sizing of HPC solution benchmarking of application different systems installation continual improvement maintenance, integration onsite customer into support & hardware training customer’s managed services assembly environment 10
  • to important middleware components like cluster management or developer tools and the customer’s production applications. Onsite delivery means onsite integration into the customer’s production environment, be it establishing network connectivity to the corporate network, or setting up software and configura- tion parts. transtec HPC clusters are ready-to-run systems – we deliver, youHPC @ TRANSTEC: SERVICES AND CUSTOMER CARE FROM A TO Z turn the key, the system delivers high performance. Every HPCtranstec AG has over 30 years of experience in scientific comput- project entails transfer to production: IT operation processes anding and is one of the earliest manufacturers of HPC clusters. policies apply to the new HPC system. Effectively, IT personnel isFor nearly a decade, transtec has delivered highly customized trained hands-on, introduced to hardware components and soft-High Performance clusters based on standard components to ware, with all operational aspects of configuration management.academic and industry customers across Europe with all thehigh quality standards and the customer-centric approach that transtec services do not stop when the implementation projectstranstec is well known for. ends. Beyond transfer to production, transtec takes care. transtec offers a variety of support and service options, tailored to theEvery transtec HPC solution is more than just a rack full of hard- customer’s needs. When you are in need of a new installation, aware – it is a comprehensive solution with everything the HPC major reconfiguration or an update of your solution – transtec isuser, owner, and operator need. able to support your staff and, if you lack the resources for main- taining the cluster yourself, maintain the HPC solution for you.In the early stages of any customer’s HPC project, transtec ex- From Professional Services to Managed Services for daily opera-perts provide extensive and detailed consulting to the customer tions and required service levels, transtec will be your complete– they benefit from expertise and experience. Consulting is fol- HPC service and solution provider. transtec’s high standards oflowed by benchmarking of different systems with either specifi- performance, reliability and dependability assure your productiv-cally crafted customer code or generally accepted benchmarking ity and complete satisfaction.routines; this aids customers in sizing and devising the optimaland detailed HPC configuration. transtec’s offerings of HPC Managed Services offer customers the possibility of having the complete management and administra-Each and every piece of HPC hardware that leaves our factory tion of the HPC cluster managed by transtec service specialists,undergoes a burn-in procedure of 24 hours or more if necessary. in an ITIL compliant way. Moreover, transtec’s HPC on DemandWe make sure that any hardware shipped meets our and our services help provide access to HPC resources whenever theycustomers’ quality requirements. transtec HPC solutions are turn- need them, for example, because they do not have the possibilitykey solutions. By default, a transtec HPC cluster has everything of owning and running an HPC cluster themselves, due to lackinginstalled and configured – from hardware and operating system infrastructure, know-how, or admin staff. 11
  • Bright Cluster Manager removes the complexity from theinstallation, management and use of HPC clusters, withoutcompromizing performance or capability. With Bright ClusterManager, an administrator can easily install, use and managemultiple clusters simultaneously, without the need for expertknowledge of Linux or HPC. 13
  • CLUSTER MANAGEMENT MADE EASY A UNIFIED APPROACH Other cluster management offerings take a “toolkit” approachBRIGHT CLUSTER MANAGER in which a Linux distribution is combined with many third-partyTHE CLUSTER INSTALLER TAKES THE ADMINISTRATOR THROUGH THE tools for provisioning, monitoring, alerting, etc.INSTALLATION PROCESS AND OFFERS ADVANCED OPTIONS SUCH AS“EXPRESS” AND “REMOTE”. This approach has critical limitations because those separate tools were not designed to work together, were not designed for HPC, and were not designed to scale. Furthermore, each of the tools has its own interface (mostly command-line based), and each has its own daemons and databases. Countless hours of scripting and testing from highly skilled people are required to get the tools to work for a specific cluster, and much of it goes undocumented. Bright Cluster Manager takes a much more fundamental, inte- grated and unified approach. It was designed and written from the ground up for straightforward, efficient, comprehensive clus- ter management. It has a single lightweight daemon, a central database for all monitoring and configuration data, and a singleBY SELECTING A CLUSTER NODE IN THE TREE ON THE LEFT AND THE TASKS CLI and GUI for all cluster management functionality.TAB ON THE RIGHT, THE ADMINISTRATOR CAN EXECUTE A NUMBER OFPOWERFUL TASKS ON THAT NODE WITH JUST A SINGLE MOUSE CLICK.. This approach makes Bright Cluster Manager extremely easy to use, scalable, secure and reliable, complete, flexible, and easy to maintain and support. EASE OF INSTALLATION Bright Cluster Manager is easy to install. Typically, system admin- istrators can install and test a fully functional cluster from “bare metal” in less than an hour. Configuration choices made during the installation can be modified afterwards. Multiple installation modes are available, including unattended and remote modes. Cluster nodes can be automatically identified based on switch ports rather than MAC addresses, improving speed and reliability of installation, as well as subsequent maintenance. 14
  • EASE OF USE are performed through one intuitive, visual interface.Bright Cluster Manager is easy to use. System administrators Multiple clusters can be managed simultaneously. The CMGUIhave two options: the intuitive Cluster Management Graphical runs on Linux, Windows and MacOS (coming soon) and can beUser Interface (CMGUI) and the powerful Cluster Management extended using plugins. The CMSH provides practically the sameShell (CMSH). The CMGUI is a standalone desktop application functionality as the Bright CMGUI, but via a command-line inter-that provides a single system view for managing all hardware face. The CMSH can be used both interactively and in batch modeand software aspects of the cluster through a single point of via scripts. Either way, system administrators now have unprec-control. Administrative functions are streamlined as all tasks edented flexibility and control over their clusters.CLUSTER METRICS, SUCH AS GPU AND CPU TEMPERATURES, FAN SPEEDS AND NETWORKS STATISTICS CAN BE VISUALIZED BY SIMPLY DRAGGING AND DROPPING THEM FROMTHE LIST ON THE LEFT INTO A GRAPHING WINDOW ON THE RIGHT. MULTIPLE METRICS CAN BE COMBINED IN ONE GRAPH AND GRAPHS CAN BE ZOOMED INTO. GRAPH LAYOUTAND COLORS CAN BE TAILORED TO YOUR REQUIREMENTS. 15
  • CLUSTER MANAGEMENT MADE EASY SUPPORT FOR LINUX AND WINDOWS Bright Cluster Manager is based on Linux and is availableBRIGHT CLUSTER MANAGER with a choice of pre-integrated, pre-configured and opti- mized Linux distributions, including SUSE Linux Enterprise THE STATUS OF CLUSTER NODES, SWITCHES, OTHER HARDWARE, AS WELL AS UP TO SIX METRICS CAN BE VISUALIZED IN THE RACKVIEW. A ZOOM-OUT OPTION IS AVAIL- ABLE FOR CLUSTERS WITH MANY RACKS.THE OVERVIEW TAB PROVIDES INSTANT, HIGH-LEVEL INSIGHT INTOTHE STATUS OF THE CLUSTER. Server, Red Hat Enterprise Linux, CentOS and Scientific Linux. Dual-boot installations with Windows HPC Server are supported as well, allowing nodes to either boot from the Bright-managed Linux head node, or the Windows-managed head node. EXTENSIVE DEVELOPMENT ENVIRONMENT Bright Cluster Manager provides an extensive HPC development environment for both serial and parallel applications, including the following (some optional): 16
  •  Compilers, including full suites from GNU, Intel, AMD and THE PARALLEL SHELL ALLOWS FOR SIMULTANEOUS EXECUTION OF COMMANDS OR SCRIPTS ACROSS NODE GROUPS OR ACROSS THE ENTIRE CLUSTER. Portland Group Debuggers and profilers, including the GNU debugger and profiler, TAU, TotalView, Allinea DDT and Allinea OPT GPU libraries, including CUDA and OpenCL MPI libraries, including OpenMPI, MPICH, MPICH2, MPICH- MX, MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with the compilers installed on Bright Cluster Manager, and optimized for high-speed interconnects such as InfiniBand and Myrinet Mathematical libraries, including ACML, FFTW, GMP, GotoBLAS, MKL and ScaLAPACK Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net- CDF and PETScBright Cluster Manager also provides Environment Modules to Linux kernels can be assigned to individual images. Incremen-make it easy to maintain multiple versions of compilers, librar- tal changes to images can be deployed to live nodes withouties and applications for different users on the cluster, without rebooting or re-installation.creating compatibility conflicts. Each Environment Module file The provisioning system propagates only changes to thecontains the information needed to configure the shell for an images, minimizing time and impact on system performanceapplication, and automatically sets these variables correctly and availability. Provisioning capability can be assigned tofor the particular application when it is loaded. Bright Cluster any number of nodes on-the-fly, for maximum flexibility andManager includes many preconfigured module files for many scalability. Bright Cluster Manager can also provision overscenarios, such as combinations of compliers, mathematical InfiniBand and to RAM disk.and MPI libraries. COMPREHENSIVE MONITORINGPOWERFUL IMAGE MANAGEMENT AND PROVISIONING With Bright Cluster Manager, system administrators can collect,Bright Cluster Manager features sophisticated software image monitor, visualize and analyze a comprehensive set of metrics.management and provisioning capability. A virtually unlimited Practically all software and hardware metrics available to thenumber of images can be created and assigned to as many Linux kernel, and all hardware management interface metricsdifferent categories of nodes as required. Default or custom (IPMI, iLO, etc.) are sampled. 17
  • CLUSTER MANAGEMENT MADE EASYBRIGHT CLUSTER MANAGER HIGH PERFORMANCE MEETS EFFICIENCY Initially, massively parallel systems constitute a challenge to both administrators and users. They are complex beasts. Any- one building HPC clusters will need to tame the beast, master the complexity and present users and administrators with an easy-to-use, easy-to-manage system landscape. Leading HPC solution providers such as transtec achieve this goal. They hide the complexity of HPC under the hood and match high performance with efficiency and ease-of-use for both users and administrators. The “P” in “HPC” gains a double meaning: “Performance” plus “Productivity”. Cluster and workload management software like Moab HPC Suite, Bright Cluster Manager or QLogic IFS provide the means to master and hide the inherent complexity of HPC systems. For administrators and users, HPC clusters are presented as single, large machines, with many different tuning parameters. The software also provides a unified view of existing clusters when- ever unified management is added as a requirement by the customer at any point in time after the first installation. Thus, daily routine tasks such as job management, user management, queue partitioning and management, can be performed easily with either graphical or web-based tools, without any advanced scripting skills or technical expertise required from the adminis- trator or user. 18
  •  Powerful cluster automation functionality allows preemptive actions based on monitoring thresholds  Comprehensive cluster monitoring and health checking framework, including automatic sidelining of unhealthy nodes to prevent job failure Scalability from Deskside to TOP500  Off-loadable provisioning for maximum scalabilityTHE BRIGHT ADVANTAGE  Proven on some of the world’s largest clustersBright Cluster Manager offers many advantages that lead toimproved productivity, uptime, scalability, performance and Minimum Overhead/Maximum Performancesecurity, while reducing total cost of ownership.  Single lightweight daemon drives all functionality  Daemon heavily optimized to minimize effect on operatingRapid Productivity Gains system and applications Easy to learn and use, with an intuitive GUI  Single database stores all metric and configuration data Quick installation: from bare metal to a cluster ready to use, in less than an hour Top Security Fast, flexible provisioning: incremental, live, disk-full, disk-  Automated security and other updates from key-signed less, provisioning over InfiniBand, auto node discovery repositories Comprehensive monitoring: on-the-fly graphs, rackview,  Encrypted external and internal communications (optional) multiple clusters, custom metrics  X.509v3 certificate-based public-key authentication Powerful automation: thresholds, alerts, actions  Role-based access control and complete audit trail Complete GPU support: NVIDIA, AMD ATI, CUDA, OpenCL  Firewalls and secure LDAP On-demand SMP: instant ScaleMP virtual SMP deployment Powerful cluster management shell and SOAP API for auto- mating tasks and creating custom capabilities Seamless integration with leading workload managers: PBS Pro, Moab, Maui, SLURM, Grid Engine, Torque, LSF Integrated (parallel) application development environment. Easy maintenance: automatically update your cluster from Linux and Bright Computing repositories Web-based user portal Bright ComputingMaximum Uptime Unattended, robust head node failover to spare head node 19
  • CLUSTER MANAGEMENT MADE EASY Examples include CPU and GPU temperatures, fan speeds, switches, hard disk SMART information, system load, memoryBRIGHT CLUSTER MANAGER utilization, network statistics, storage metrics, power systems statistics, and workload management statistics. Custom metrics can also easily be defined. Metric sampling is done very efficiently – in one process, or out-of-band where possible. System administrators have full flexibility over how and when metrics are sampled, and historic data can be consolidated over time to save disk space. THE AUTOMATION CONFIGURATION WIZARD GUIDES THE SYSTEM ADMINISTRATOR THROUGH THE STEPS OF DEFINING A RULE: SELECTING METRICS, DEFINING THRESH- OLDS AND SPECIFYING ACTIONS. CLUSTER MANAGEMENT AUTOMATION Cluster management automation takes preemptive actions when predetermined system thresholds are exceeded, sav- ing time and preventing hardware damage. System thresh- olds can be configured on any of the available metrics. The built-in configuration wizard guides the system administra- 20
  • tor through the steps of defining a rule: selecting metrics, EXAMPLE GRAPHS THAT VISUALIZE METRICS ON A GPU CLUSTER.defining thresholds and specifying actions. For example,a temperature threshold for GPUs can be established thatresults in the system automatically shutting down an over-heated GPU unit and sending an SMS message to the systemadministrator’s mobile phone. Several predefined actions areavailable, but any Linux command or script can be config-ured as an action.COMPREHENSIVE GPU MANAGEMENTBright Cluster Manager radically reduces the time and ef-fort of managing GPUs, and fully integrates these devicesinto the single view of the overall system. Bright includespowerful GPU management and monitoring capability thatleverages functionality in NVIDIA Tesla GPUs. System admin-istrators can easily assume maximum control of the GPUsand gain instant and time-based status insight. In additionto the standard cluster management capabilities, BrightCluster Manager monitors the full range of GPU metrics,including: MULTI-TASKING VIA PARALLEL SHELL GPU temperature, fan speed, utilization The parallel shell allows simultaneous execution of multiple GPU exclusivity, compute, display, persistance mode commands and scripts across the cluster as a whole, or across GPU memory utilization, ECC statistics easily definable groups of nodes. Output from the executed Unit fan speed, serial number, temperature, power commands is displayed in a convenient way with variable levels usage, voltages and currents, LED status, firmware of verbosity. Running commands and scripts can be killed easily Board serial, driver version, PCI info if necessary. The parallel shell is available through both the CMGUI and the CMSH.Beyond metrics, Bright Cluster Manager features built-insupport for GPU computing with CUDA and OpenCL libraries. INTEGRATED WORKLOAD MANAGEMENTSwitching between current and previous versions of CUDA and Bright Cluster Manager is integrated with a wide selection ofOpenCL has also been made easy. free and commercial workload managers. This integration 21
  • CLUSTER MANAGEMENT MADE EASY provides a number of benefits:  The selected workload manager gets automatically installedBRIGHT CLUSTER MANAGER and configured  Many workload manager metrics are monitored  The GUI provides a user-friendly interface for configuring, monitoring and managing the selected workload manager  The CMSH and the SOAP API provide direct and powerful access to a number of workload manager commands and metricsWORKLOAD MANAGEMENT QUEUES CAN BE VIEWED AND CON- CREATING AND DISMANTLING A VIRTUAL SMP NODE CAN BE ACHIEVED WITH JUSTFIGURED FROM THE GUI, WITHOUT THE NEED FOR WORKLOAD A FEW CLICKS WITHIN THE GUI OR A SINGLE COMMAND IN THE CLUSTER MANAGE-MANAGEMENT EXPERTISE. MENT SHELL. 22
  •  Reliable workload manager failover is properly configured MAXIMUM UPTIME WITH HEALTH CHECKING The workload manager is continuously made aware of the Bright Cluster Manager – Advanced Edition includes a powerful health state of nodes (see section on Health Checking) cluster health checking framework that maximizes system uptime. It continually checks multiple health indicators for all hardwareThe following user-selectable workload managers are tightly and software components and proactively initiates correctiveintegrated with Bright Cluster Manager: actions. It can also automatically perform a series of standard PBS Pro, Moab, Maui, LSF and user-defined tests just before starting a new job, to ensure SLURM, Grid Engine, Torque a successful execution. Examples of corrective actions include autonomous bypass of faulty nodes, automatic job requeuing toAlternatively, Lava, LoadLeveler or other workload managers can avoid queue flushing, and process “jailing” to allocate, track, tracebe installed on top of Bright Cluster Manager. and flush completed user processes. The health checking frame- work ensures the highest job throughput, the best overall clusterINTEGRATED SMP SUPPORT efficiency and the lowest administration overhead.Bright Cluster Manager – Advanced Edition dynamically ag-gregates multiple cluster nodes into a single virtual SMP node, WEB-BASED USER PORTALusing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating The web-based user portal provides read-only access to essentialand dismantling a virtual SMP node can be achieved with just cluster information, including a general overview of the clustera few clicks within the CMGUI. Virtual SMP nodes can also be status, node hardware and software properties, workload managerlaunched and dismantled automatically using the scripting statistics and user-customizable graphs. The User Portal can easilycapabilities of the CMSH. In Bright Cluster Manager a virtual be customized and expanded using PHP and the SOAP API.SMP node behaves like any other node, enabling transparent,on-the-fly provisioning, configuration, monitoring and man- USER AND GROUP MANAGEMENTagement of virtual SMP nodes as part of the overall system Users can be added to the cluster through the CMGUI or themanagement. CMSH. Bright Cluster Manager comes with a pre-configured LDAP database, but an external LDAP service, or alternativeMAXIMUM UPTIME WITH HEAD NODE FAILOVER authentication system, can be used instead.Bright Cluster Manager – Advanced Edition allows two headnodes to be configured in active-active failover mode. Both ROLE-BASED ACCESS CONTROL AND AUDITINGhead nodes are on active duty, but if one fails, the other takes Bright Cluster Manager’s role-based access control mechanismover all tasks, seamlessly. allows administrator privileges to be defined on a per-role basis. 23
  • CLUSTER MANAGEMENT MADE EASY Administrator actions can be audited using an audit file which stores all their write action.BRIGHT CLUSTER MANAGER TOP CLUSTER SECURITY Bright Cluster Manager offers an unprecedented level of secu- rity that can easily be tailored to local requirements. Security features include:  Automated security and other updates from key-signed Linux and Bright Computing repositories  Encrypted internal and external communications  X.509v3 certificate based public-key authentication to the cluster management infrastructure THE WEB-BASED USER PORTAL PROVIDES READ-ONLY ACCESS TO ESSENTIAL CLUSTER INFORMATION, INCLUDING A GENERAL OVERVIEW OF THE CLUSTER STATUS, NODE HARDWARE AND SOFTWARE PROPERTIES, WORKLOAD MANAGER STATISTICS AND USER-CUSTOMIZABLE GRAPHS. “The building blocks for transtec HPC solu- tions must be chosen according to our goals ease-of-management and ease-of-use. With Bright Cluster Manager, we are happy to have the technology leader at hand, meeting these requirements, and our customers value that.” Armin Jäger HPC Solution Engineer 24
  •  Role-based access control and complete audit trail STANDARD AND ADVANCED EDITIONS Firewalls and secure LDAP Bright Cluster Manager is available in two editions: Standard Secure shell access and Advanced. The table on this page lists the differences. You can easily upgrade from the Standard to the Advanced EditionMULTI-CLUSTER CAPABILITY as your cluster grows in size or complexity.Bright Cluster Manager is ideal for organizations that need tomanage multiple clusters, either in one or in multiple locations. DOCUMENTATION AND SERVICESCapabilities include: A comprehensive system administrator manual and user manu- All cluster management and monitoring functionality availa- al are included in PDF format. Customized training and profes- ble for all clusters through one GUI sional services are available. Services include various levels of Selecting any set of configurations in one cluster and support, installation services and consultancy. export them to any or all other clusters with a few mouse clicks Making node images available to other clusters.BRIGHT CLUSTER MANAGER CAN MANAGE MULTIPLE CLUSTERS SIMULTANEOUSLY. CLUSTER HEALTH CHECKS CAN BE VISUALIZED IN THE RACKVIEW. THIS SCREENSHOTTHIS OVERVIEW SHOWS CLUSTERS IN OSLO, ABU DHABI AND HOUSTON, ALL MAN- SHOWS THAT GPU UNIT 41 FAILS A HEALTH CHECK CALLED “ALLFANSRUNNING”.AGED THROUGH ONE GUI. 25
  • CLUSTER MANAGEMENT MADE EASYBRIGHT CLUSTER MANAGERFEATURE STANDARD ADVANCEDChoice of Linux distributions x xIntel Cluster Ready x xCluster Management GUI x xCluster Management Shell x xWeb-Based User Portal x xSOAP API x xNode Provisioning x xNode Identification x xCluster Monitoring x xCluster Automation x xUser Management x xParallel Shell x xWorkload Manager Integration x xCluster Security x xCompilers x xDebuggers & Profilers x xMPI Libraries x xMathematical Libraries x xEnvironment Modules x xNVIDIA CUDA & OpenCL x xGPU Management & Monitoring x xScaleMP Management & Monitoring - xRedundant Failover Head Nodes - xCluster Health Checking - xOff-loadable Provisioning - xSuggested Number of Nodes 4–128 129–10,000+Multi-Cluster Management - xStandard Support x xPremium Support Optional Optional 26
  • 27
  • While all HPC systems face challenges in workload demand,resource complexity, and scale, enterprise HPC systems facemore stringent challenges and expectations. Enterprise HPCsystems must meet mission-critical and priority HPC workloaddemands for commercial businesses and business-orientedresearch and academic organizations. They have complex SLAsand priorities to balance. Their HPC workloads directly impactthe revenue, product delivery, and organizational objectivesof their organizations. 29
  • INTELLIGENT MOAB HPC SUITE Moab is the most powerful intelligence engine for policy-based,HPC WORKLOAD MANAGEMENT predictive scheduling across workloads and resources. MoabMOAB HPC SUITE – ENTERPRISE EDITION HPC Suite accelerates results delivery and maximize utiliza- tion while simplifying workload management across complex, heterogeneous cluster environments. The Moab HPC Suite products leverage the multi-dimensional policies in Moab to continually model and monitor workloads, resources, SLAs, and priorities to optimize workload output. And these policies utilize the unique Moab management abstraction layer that integrates data across heterogeneous resources and resource managers to maximize control as you automate workload man- agement actions. Managing the World’s Top Systems, Ready to Manage Yours Moab manages the world’s largest, most scale-intensive and complex HPC environments in the world including 40% of the top 10 supercomputing systems, nearly 40% of the top 25 and 36% of the compute cores in the top 100 systems based on rankings from www.Top500.org. So you know it is battle-tested and ready “With Moab HPC Suite, we can meet very de- to efficiently and intelligently manage the complexities of your manding customers’ requirements as regards environment. unified management of heterogeneous cluster environments, grid management, and provide MOAB HPC SUITE – ENTERPRISE EDITION them with flexible and powerful configuration Moab HPC Suite - Enterprise Edition provides enterprise-ready and reporting options. Our customers value HPC workload management that self-optimizes the productivity, that highly.” workload uptime and meeting of SLAs and business priorities for HPC systems and HPC cloud. It uses the battle-tested and patented Moab intelligence engine to automate the mission- Thomas Gebert HPC Solution Architect critical workload priorities of enterprise HPC systems. Enterprise customers benefit from a single integrated product that brings 30
  • together key enterprise HPC capabilities, implementation, train- achievement of business objectives and outcomes that depending, and 24x7 support services to speed the realization of benefits on the results the enterprise HPC systems deliver. Moab HPCfrom their HPC system for their business. Moab HPC Suite – En- Suite Enterprise Edition delivers:terprise Edition delivers: Productivity acceleration Productivity acceleration to get more results faster and at a Uptime automation lower cost Auto-SLA enforcement Moab HPC Suite – Enterprise Edition gets more results delivered Grid- and cloud-ready HPC management faster from HPC resources to lower costs while accelerating overall system, user and administrator productivity. MoabDesigned to Solve Enterprise HPC Challenges provides the unmatched scalability, 90-99 percent utilization,While all HPC systems face challenges in workload and resource and fast and simple job submission that is required to maximizecomplexity, scale and demand, enterprise HPC systems face productivity in enterprise HPC organizations. The Moab intel-more stringent challenges and expectations. Enterprise HPC ligence engine optimizes workload scheduling and orchestratessystems must meet mission-critical and priority HPC workload resource provisioning and management to maximize workloaddemands for commercial businesses and business-oriented speed and quantity. It also unifies workload managementresearch and academic organizations. These organizations have across heterogeneous resources, resource managers and evencomplex SLA and priorities to balance. And their HPC workloads multiple clusters to reduce management complexity and costs.directly impact the revenue, product delivery, and organization-al objectives of their organizations. Uptime automation to ensure workload completes successfullyEnterprise HPC organizations must eliminate job delays and HPC job and resource failures in enterprise HPC systems lead tofailures. They are also seeking to improve resource utilization delayed results and missed organizational opportunities andand workload management efficiency across multiple heteroge- objectives. Moab HPC Suite – Enterprise Edition intelligentlyneous systems. To maximize user productivity, they are required automates workload and resource uptime in HPC systems to en-to make it easier to access and use HPC resources for users and sure that workload completes reliably and avoids these failures.even expand to other clusters or HPC cloud to better handleworkload demand and surges. Auto-SLA enforcement to consistently meet service guaran- tees and business prioritiesBENEFITS Moab HPC Suite – Enterprise Edition uses the powerful MoabMoab HPC Suite - Enterprise Edition offers key benefits to intelligence engine to optimally schedule and dynamicallyreduce costs, improve service performance, and accelerate the adjust workload to consistently meet service level agreementsproductivity of enterprise HPC systems. These benefits drive the (SLAs), guarantees, and business priorities. This automatically 31
  • INTELLIGENT ensures that the right workloads are completed at the optimal times, taking into account the complex number of departments,HPC WORKLOAD MANAGEMENT priorities and SLAs to be balanced.MOAB HPC SUITE – ENTERPRISE EDITION Grid- and Cloud-ready HPC management to more efficiently manage and meet workload demand The benefits of a traditional HPC environment can be extended to more efficiently manage and meet workload and resource demand by sharing workload across multiple clusters through grid management and the HPC cloud management capabilities provided in Moab HPC Suite – Enterprise Edition. CAPABILITIES Moab HPC Suite – Enterprise Edition brings together key en- terprise HPC capabilities into a single integrated product that self-optimizes the productivity, workload uptime, and meeting of SLA’s and priorities for HPC systems and HPC Cloud. Productivity acceleration capabilities deliver more results faster, lower costs, and increase resource, user and administra- tor productivityARCHITECTURE  Massive scalability accelerates job response and through- put, including support for high throughput computing  Workload-optimized allocation policies and provisioning gets more results out of existing heterogeneous resources to reduce costs  Workload unification across heterogeneous clusters maxi- mizes resource availability for workloads and administration efficiency by managing workload as one cluster  Simplified HPC submission and control for both users and ad- ministrators with job arrays, templates, self-service submission 32
  • portal and administrator dashboard (i.e. usage limits, usage reports, etc.) Optimized intelligent scheduling that packs workloads and  SLA and priority polices ensure the highest priority workloads backfills around priority jobs and reservations while balancing are processed first (i.e. quality of service, hierarchical priority SLAs to efficiently use all available resources weighting, dynamic fairshare policies, etc.) Advanced scheduling and management of GPGPUs for jobs to  Continuous plus future scheduling ensures priorities and gua- maximize their utilization including auto-detection, policy-based rantees are proactively met as conditions and workload levels GPGPU scheduling and GPGPU metrics reporting change (i.e. future reservations, priorities, and pre-emption) Workload-aware auto-power management reduces energy use and costs by 30-40 percent with intelligent workload consolidati- Grid- and cloud-ready HPC management extends the benefits of on and auto-power management your traditional HPC environment to more efficiently manage workload and better meet workload demandUptime automation capabilities ensure workload completes suc-  Pay-for-use showback and chargeback capabilities trackcessfully and reliably, avoiding failures and missed organizational actual resource usage with flexible chargeback options andopportunities and objectives reporting by user or department Intelligent resource placement prevents job failures with gra-  Manage and share workload across multiple remote nular resource modeling that ensures all workload requirements clusters to meet growing workload demand or surges with are met while avoiding at-risk resources the single self-service portal and intelligence engine with Auto-response to incidents and events maximizes job and sys- purchase of Moab HPC Suite - Grid Option tem uptime with configurable actions to pre-failure conditions, amber alerts, or other metrics and monitors ARCHITECTURE Workload-aware maintenance scheduling helps maintain a Moab HPC Suite - Enterprise Edition is architected to integrate stable HPC system without disrupting workload productivity on top of your existing job resource managers and other types Real-world services expertise ensures fast time to value and of resource managers in your environment. It provides policy- system uptime with included package of implementation, trai- based scheduling and management of workloads as well as ning, and 24x7 remote support services resource allocation and provisioning orchestration. The Moab intelligence engine makes complex scheduling and manage-Auto-SLA enforcement schedules and adjusts workload to con- ment decisions based on all of the data it integrates from thesistently meet service guarantees and business priorities so the various resource managers and then orchestrates the job andright workloads are completed at the optimal times management actions through those resource managers. It Department budget enforcement schedules resources in does this without requiring any additional agents. This makes line with resource sharing agreements and budgets it the ideal choice to integrate with existing and new systems 33
  • INTELLIGENTHPC WORKLOAD MANAGEMENTNEW IN MOAB 7.0 NEW MOAB HPC SUITE 7.0 The new Moab HPC Suite 7.0 releases deliver continued break- through advancements in scalability, reliability, and job array management to accelerate system productivity as well as ex- tended database support. Here is a look at the new capabilities and the value they offer customers: TORQUE Resource Manager Scalability and Reliability Ad- vancements for Petaflop and Beyond As part of the Moab HPC Suite 7.0 releases, the TORQUE 4.0 resource manager features scalability and reliability advance- ments to fully exploit Moab scalability. These advancements maximize your use of increasing hardware capabilities and enable you to meet growing HPC user needs. Key advancements in TORQUE 4.0 for Moab HPC Suite 7.0 include:  The new Job Radix enables you to efficiently run jobs that span tens of thousands or even hundreds of thousands of nodes. Each MOM daemon now cascades job communication with multiple other MOM daemons simultaneously to reduce the job start-up process time to a small fraction of what it would normally take across a large number of nodes. The Job Radix eliminates lost jobs and job start-up bottlenecks caused by having all nodes MOM daemons communicating with only one head MOM node. This saves critical minutes on job start-up process time and allows for higher job throughput. 34
  •  New MOM daemon communication hierarchy increases gration with existing user portals, plug-ins of resource manag- the number of nodes supported and reduces the overhead ers for rich data integration, and script integration. Customers of cluster status updates by distributing communication now have a standard interface to Moab with REST APIs. across multiple nodes instead of a single TORQUE head node. This makes status updates more efficient faster sched- Simplified Self-Service and Admin Dashboard Portal Experience uling and responsiveness. Moab HPC 7.0 features an enhanced self-service and admin New multi-threading improves response and reliability, dashboard portal with simplified “click-based” job submission allowing for instant feedback to user requests as well as the for end users as well as new visual cluster dashboard views of ability to continue work even if some processes linger. nodes, jobs, and reservations for more efficient management. The Improved network communications with all UDP-based new Visual Cluster dashboard provides administrators and users communication replaced with TCP to make data transfers views of their cluster resources that are easily filtered by almost from node to node more reliable. any factors including id, name, IP address, state, power, pending actions, reservations, load, memory, processors, etc. Users canJob Array Auto-Cancellation Policies Improve System Productivity also quickly filter and view their jobs by name, state, user, group,Moab HPC Suite 7.0 improves system productivity with new job ar- account, wall clock requested, memory requested, start date/ray auto-cancellation policies that cancel remaining sub-jobs in an time, submit date/time, etc. One-click drill-downs provide addi-array once the solution is found in the array results. This frees up tional details and options for management actions.resources, which would otherwise be running irrelevant jobs, to runother jobs in the queue jobs quicker. The job array auto-cancellation Resource Usage Accounting Flexibilitypolicies allow you to set auto-cancellations of sub-jobs based on Moab HPC Suite 7.0 includes more flexible resource usage ac-first, any instance of results success or failure, or specific exit codes. counting options that enable administrators to easily duplicate custom organizational hierarchies such as organization, groups,Extended Database Support Now Includes PostgreSQL and projects, business units, cost centers etc. in the Moab Account-Oracle in Addition to MySQL ing Manager usage budgets and charging structure. This ensuresThe extended database support in Moab HPC Suite 7.0 enables resource usage is budgeted , tracked, and reported or chargedcustomers to use ODBC-compliant PostgreSQL and Oracle back for in the most useful way to admins and their customerdatabases in addition to MySQL. This provides customers the groups and users.flexibility to use the database that best meets their needs or isthe standard for their system.New Moab Web Services Provide Easier Standard Integrationand CustomizationNew Moab Web Services provide easier standard integrationand customization for a customer’s environment such as inte- 35
  • INTELLIGENT as well as to manage your HPC system as it grows and expands in the future.HPC WORKLOAD MANAGEMENTMOAB HPC SUITE – BASIC EDITION Moab HPC Suite – Enterprise Edition includes the patented Moab intelligence engine that enables it to integrate with and automate management across existing heterogeneous environ- ments to optimize management and workload efficiency. This unique intelligence engine includes:  Industry leading multi-dimensional policies that automate the complex real-time decisions and actions for scheduling workload and allocating and adapting resources. These mul- ti-dimensional policies can model and consider the workload requirements, resource attributes and affinities, SLAs and priorities to enable more complex and efficient decisions to be automated.  Real-time and predictive future environment scheduling that drives more accurate and efficient decisions and service guarantees as it can proactively adjust scheduling and re- source allocations as it projects the impact of workload and resource condition changes.  Open & flexible management abstraction layer lets you integrate the data and orchestrate workload actions across the chaos of complex heterogeneous cluster environments and management middleware to maximize workload control, automation, and optimization. COMPONENTS Moab HPC Suite – Enterprise Edition includes the following inte- grated products and technologies for a complete HPC workload management solution:  Moab Workload Manager: Patented multi-dimensional 36
  • intelligence engine that automates the complex decisions based workload management system that accelerates and auto- and orchestrates policy-based workload placement and mates the scheduling, managing, monitoring, and reporting of scheduling as well as resource allocation, provisioning and HPC workloads on massive scale, multi-technology installations. energy management The Moab HPC Suite – Basic Edition patented multi-dimensional Moab Cluster Manager: Graphical desktop administrator decision engine accelerates both the decisions and orchestrati- application for managing, configuring, monitoring, and on of workload across the ideal combination of diverse resour- reporting for Moab managed clusters ces, including specialized resources like GPGPUs. The speed and Moab Viewpoint: Web-based user self-service job submis- accuracy of the decisions and scheduling automation optimizes sion and management portal and administrator dashboard workload throughput and resource utilization so more work portal is accomplished in less time with existing resources to control Moab Accounting Manager: HPC resource use budgeting costs and increase the value out of HPC investments. and accounting tool that enforces resource sharing agree- ments and limits based on departmental budgets and provi- Moab HPC Suite – Basic Edition enables you to address pressing des showback and chargeback reporting for resource usage HPC challenges including: Moab Services Manager: Integration interfaces to resource  Delays to workload start and end times slowing results managers and third-party tools  Inconsistent delivery on service guarantees and SLA commit- mentsMoab HPC Suite – Enterprise Edition is also integrated with  Under-utilization of resourcesTORQUE which is available as a free download on AdaptiveCom-  How to efficiently manage workload across heterogeneous andputing.com. TORQUE is an open-source job/resource manager hybrid systems of GPGPUs, hardware, and middlewarethat provides continually updated information regarding the  How to simplify job submission & management for users andstate of nodes and workload status. Adaptive Computing is the administrators to maximize productivitycustodian of the TORQUE project and is actively developingthe code base in cooperation with the TORQUE community to Moab HPC Suite – Basic Edition acts as the “brain” of an HPCprovide state of the art resource management. Each Moab HPC system to accelerate and automate complex decision makingSuite product subscription includes support for the Moab HPC processes. The patented decision engine is capable of makingSuite as well as TORQUE, if you choose to use TORQUE as the the complex multi-dimensional policy-based decisions needed tojob/resource manager for your cluster. schedule workload to optimize job speed, job success and resource utilization. Moab HPC Suite – Basic Edition integrates decision-MOAB HPC SUITE – BASIC EDITION making data from and automates actions through your system’sMoab HPC Suite – Basic Edition is a multi-dimensional policy- existing mix of resource managers. This enables all the dimensions 37
  • INTELLIGENT of real-time granular resource attributes and state as well as the timing of current and future resource commitments to be factoredHPC WORKLOAD MANAGEMENT into more efficient and accurate scheduling and allocation decisi-MOAB HPC SUITE – BASIC EDITION ons. It also dramatically simplifies the management tasks and pro- cesses across these complex, heterogeneous environments. Moab works with many of the major resource management and industry standard resource monitoring tools covering mixed hardware,MOAB HPC SUITE - BASIC EDITION network, storage and licenses. Moab HPC Suite – Basic Edition policies are also able to factor in organizational priorities and complexities when scheduling workload and allocating resources. Moab ensures workload is pro- cessed according to organizational priorities and commitments and that resources are shared fairly across users, groups and even multiple organizations. This enables organizations to automati- cally enforce service guarantees and effectively manage organiza- tional complexities with simple policy-based settings. BENEFITS Moab HPC Suite – Basic Edition drives more ROI and results from your HPC environment including:  Improved job response times and job throughput with a workload decision engine that accelerates complex wor- kload scheduling decisions to enable faster job start times and high throughput computing  Optimized resource utilization to 90-99 percent with multi- dimensional and predictive workload scheduling to accomp- lish more with your existing resources  Automated enforcement of service guarantees, priorities, and resource sharing agreements across users, groups, and projects  Increased productivity by simplifying HPC use, access, and 38
  • control for both users and administrators with job arrays, affinity- and node topology-based placement job templates, optional user portal, and GUI administrator  Backfill job scheduling speeds job throughput and maximi- management and monitoring tool zes utilization by scheduling smaller or less demanding jobs Streamline job turnaround and reduce administrative as they can fit around priority jobs and reservations to use burden by unifying and automating workload tasks and re- all available resources source processes across diverse resources and mixed-system  Security policies control which users and groups can access environments including GPGPUs which resources Provides a scalable workload management architecture  Checkpointing that can manage peta-scale and beyond, is grid-ready, compatible with existing infrastructure, and extensible to Real-time and predictive scheduling ensure job priorities and manage your environment as it grows and evolves guarantees are proactively met as conditions and workload levels changeCAPABILITIES  Advanced reservations guarantee that jobs run when requiredMoab HPC Suite – Basic Edition accelerates workload pro-  Maintenance reservations reserve resources for planned fu-cessing with a patented multi-dimensional decision engine ture maintenance to avoid disruption to business workloadsthat self-optimizes workload placement, resource utilization  Predictive scheduling enables the future workload scheduleand results output while ensuring organizational priorities to be continually forecasted and adjusted along with resour-are met across the users and groups leveraging the HPC ce allocations to adapt to changes in conditions and new jobenvironment. and reservation requestsPolicy-driven scheduling intelligently places workload on op- Advanced scheduling and management of GPGPUs for jobs totimal set of diverse resources to maximize job throughput and maximize their utilizationsuccess as well as utilization and the meeting of workload and  Automatic detection and management of GPGPUs in envi-group priorities ronment to eliminate manual configuration and make them Priority, SLA and resource sharing policies ensure the highest immediately available for scheduling priority workloads are processed first and resources are  Exclusively allocate and schedule GPGPUs on a per-job basis shared fairly across users and groups such as quality of  Policy-based management & scheduling using GPGPU service, hierarchical priority weighting, and fairshare targets, metrics limits and weights policies  Quick access to statistics on GPGPU utilization and key Allocation policies optimize resource utilization and prevent metrics for optimal management and issue diagnosis such as job failures with granular resource modeling and scheduling, error counts, temperature, fan speed, and memory 39
  • INTELLIGENT Easier submission, management, and control of job arrays im- prove user productivity and job throughput efficiencyHPC WORKLOAD MANAGEMENT  Users can easily submit thousands of sub-jobs with a singleMOAB HPC SUITE – BASIC EDITION job submission with an array index differentiating each array sub-job  Job array usage limit policies enforce number of job maxi- mums by credentials or class  Simplified reporting and management of job arrays for end users filters jobs to summarize, track and manage at the master job level Scalable job performance to large-scale, extreme-scale, and high-throughput computing environments  Efficiently manages the submission and scheduling of hund- reds of thousands of queued job submissions to support high throughput computing  Fast scheduler response to user commands while scheduling so users and administrators get the real-time job informati- on they need  Fast job throughput rate to get results started and delivered faster and keep utilization of resources up Open and flexible management abstraction layer easily integrates with and automates management across existing heterogeneous resources and middleware to improve management efficiency  Rich data integration and aggregation enables you to set pow- erful, multi-dimensional policies based on the existing real-time resource data monitored without adding any new agents  Heterogeneous resource allocation & management for wor- kloads across mixed hardware, specialty resources such as 40
  • GPGPUs, and the multiple resource managers used to manage MOAB HPC SUITE - BASIC EDITION the resources Supports integration with job resource managers such as TORQUE and SLURM as well as integrating with many other types of resource managers such as HP Cluster Management Utility, Nagios, Ganglia, FlexLM, and othersEase of use and management improves productivity for both usersand administrators Graphical administrator cluster management tool and portal provides unified workload management and reporting on re- source utilization and status across the mixed resource environ- managers and then orchestrates the job and management ment to make management, issue diagnoses and performance actions through those resource managers. This makes it the optimization easier ideal choice to integrate with existing and new systems as Optional customizable end-user portal provides visual job well as to manage your HPC system as it grows and expands submission and management from any location, such as job in the future. forms, templates and start-time estimates, to reduce training and administrator requirements Moab HPC Suite – Basic Edition is designed with a patented in- Job templates enable rapid submission of common jobs telligence engine architecture that enables it to integrate with by pre-specifying the variety of resources needed for each and automate workload management across existing hetero- job to reduce duplicate work and simplifying job submissi- geneous environments to improve management and workload ons for users efficiency. This unique architecture includes:  Industry-leading multi-dimensional policies that automateARCHITECTURE the complex real-time decisions and actions for schedulingMoab HPC Suite – Basic Edition is architected to integrate on workload and adapting resources. These multi-dimensionaltop of your existing job resource managers and other types policies can model and consider the workload requirements,of resource managers in your environment to provide the resource attributes and affinities, SLAs and priorities to enab-policy-based scheduling and management of workloads and le more complex and efficient decisions to be automated.resource allocation. It makes the complex decisions based  Real-time and predictive future environment schedulingon all of the data it integrates from the various resource & analytics that drive more accurate and efficient decisions 41
  • INTELLIGENT and service guarantees as it can proactively adjust schedu- ling and resource allocations as it projects the impact ofHPC WORKLOAD MANAGEMENT workload and resource condition changes.MOAB HPC SUITE - GRID OPTION  Open & flexible management abstraction layer lets you integrate the data and orchestrate workload management actions across the chaos of complex heterogeneous IT environments and management middleware to maximize workload control, automation, and optimization. COMPONENTS Moab HPC Suite – Basic Edition includes the following integra- ted products and technologies for a complete cluster workload management solution:  Moab Workload Manager: Patented intelligence engine that automates the complex decisions and offers automation for policy-based workload placement, scheduling and resource allocation  Moab Cluster Manager: Graphical desktop administrator application for managing, configuring, monitoring, and reporting for Moab managed clusters  Moab Viewpoint: Web-based user self-service job submis- sion and management portal and administrator dashboard portal  Moab Services: Integration interfaces to resource managers and third-party tools Moab HPC Suite – Basic Edition is also integrated with TORQUE which is available as a free download on AdaptiveComputing. com. TORQUE is an open-source job/resource manager that provides continually updated information regarding the state of nodes and workload status. Adaptive Computing is the 42
  • custodian of the TORQUE project and is actively developing Moab HPC Suite – Grid Option addresses key challenges ofthe code base in cooperation with the TORQUE community to managing and optimizing multiple clusters and moving to aprovide state of the art resource management. Each Moab HPC more efficient and productive grid environment including:Suite – Basic Edition product subscription includes support for  Multi-cluster sprawl increases administrative burden,the Moab HPC Suite – Basic Edition as well as TORQUE, if you overhead costs and inefficiency as workload and policychoose to use TORQUE as the job/resource manager for your management, monitoring, reporting and planning must becluster. done separately  Unbalanced resource utilization slows job throughputMOAB HPC SUITE – GRID OPTION and wastes resources workload waits to run on overloa-Moab HPC Suite – Grid Option is a powerful grid-workload ded clusters and underutilized resources sit idle on othermanagement solution that provides unified scheduling, ad- clustersvanced and flexible policy management, integrated resource  Can’t share workloads due to different workload require-management, and consolidated monitoring and management ments, policies and SLAs across independent group- andacross multiple clusters. Moab HPC Suite – Grid Option’s pat- organization-based clusters that make unified workloadented intelligence engine accelerates, automates and unifies decisions complex and hard to enforce consistently acrossall of the complex workload decisions and automation actions multiple clustersneeded to control and optimize the workload and resource  Need to manage across the multiple resource mana-components of advanced grids. Moab HPC Suite – Grid Option gers across multiple clusters to schedule and allocatingconnects disparate clusters into a logical whole in a matter of resources more effectively; can’t rip and replace for gridminutes, with its decision engine enabling grid administrators management due to process and script investmentsand grid policies to have sovereignty over all systems while  Need to integrate job submission for users across multi-preserving control at the individual cluster. ple credential and submission tools for multiple clustersMoab HPC Suite – Grid Option has a powerful range of to keep users productive using what they know whilecapabilities that allow organizations to consolidate report- accelerating job processing and efficiencying, synchronize management across policies processes andresources, and optimize workload-sharing and data manage- UNIFIED AND FLEXIBLE GRID MANAGEMENTment across multiple clusters. Moab HPC Suite – Grid Option Moab HPC Suite – Grid Option provides the automated com-delivers these services in a near-transparent way so users are plex decision control and management flexibility real-worldunaware they are using grid resources—they know only that grid environments require. Its multi-dimensional decisionthey are getting work done faster and more easily than ever engine is able to accelerate workload output while balanc-before. ing all the complexities of both grid and various local cluster 43
  • INTELLIGENT and organizational workload priorities and policies. It can be flexibly configured to unify workload management policiesHPC WORKLOAD MANAGEMENT into centralized management, provide centralized and localMOAB HPC SUITE - GRID OPTION management policies, or integrate between local peer-to- peer cluster management policies. This ensures that service level guarantees and overall organizational objectives are achieved while utilization and valuable results output from the unified grid environment is dramatically increased.MOAB HPC SUITE - GRID OPTION CAN BE FLEXIBLY CONFIGURED FOR BENEFITSCENTRALIZED, CENTRALIZED AND LOCAL, OR PEER-TO-PEER GRID Moab HPC Suite – Grid Option creates an optimized grid envi-POLICIES, DECISIONS AND RULES. IT IS ABLE TO MANAGE MULTIPLERESOURCE MANAGERS AND MULTIPLE MOAB INSTANCES. ronment with key benefits that accelerate workload produc- tivity and reduce management complexity including:  Fast time to value for grid implementation with unified management across heterogeneous clusters that enables you to move quickly from cluster to optimized grid  Improved job response times and job throughput with a policy-driven and predictive decision engine that accele- rates complex workload scheduling decisions to enable faster job start times and high throughput computing  Optimized throughput and utilization across grid and clusters of 90-99% with flexible multi-dimensional decision engine that optimizes workload processing at both grid and cluster level  Reduced management burden and costs with grid-wide interface and reporting tools that provide a unified view of grid resources, status and usage charts, and trends over time for capacity planning, diagnostics, and accounting  Scalable architecture to support local area to wide area grids, peta-scale, high throughput computing, and beyond  Automated enforcement of grid and local cluster level 44
  • service guarantees, priorities, and resource sharing ag-  Grid-wide workload management policies that respect local reements across users, groups, and projects sharing grid cluster configuration and needs, and local policies and rules resources if desired, including granular settings to control where jobs Advanced administrative control allows various busi- can originate and be processed ness units to access and view grid resources, regardless  Allocation policies optimize resource utilization and prevent of physical or organizational boundaries, or alternatively job failures with granular resource modeling and scheduling, restricts access to resources by specific departments or affinity- and node topology-based placement entities  Backfill job scheduling speeds job throughput and maximi- Increased productivity by simplifying use, access, and zes utilization by scheduling smaller or less demanding jobs control of broader set of HPC resources for both users and as they can fit around priority jobs and reservations to use administrators with integrated job submission, grid-aware all available resources job arrays, job templates, optional user portal, and GUI  Security policies control which users and groups can access administrator management and monitoring tool which resources with simple credential mapping and integration with popular security tools used across multipleCAPABILITIES clustersMoab HPC Suite – Grid Option accelerates workload processing  Checkpointing and Pre-emptionwith a patented multi-dimensional decision engine that self-optimizes grid-wide and local workload placement, resource Real-time and predictive scheduling ensure job priorities andutilization and results output while ensuring complex organiza- guarantees are proactively met as conditions and workloadtional priorities are met across the users and groups leveraging levels change across the gridthe grid environment.  Allow local cluster-level optimizations of most grid workload  Optimized data staging ensures that remote data transfersPolicy-driven scheduling intelligently places workload on op- are synchronized with resource availability to minimizetimal set of diverse resources to maximize job throughput and poor utilization, leveraging existing data-migration techno-success as well as utilization and the meeting of workload and logies such as Secure Copy (SCP) and GridFTPgroup priorities  Advanced reservations guarantee the availability of key Priority, SLA and resource sharing policies ensure the highest resources at specific times and that jobs run when required priority workloads are processed first and resources are  Maintenance reservations reserve resources for planned shared fairly across users and groups such as quality of future maintenance to avoid disruption to business wor- service, hierarchical priority weighting, and fairshare targets, kloads limits and weights policies  Predictive scheduling enables the future grid workload 45
  • INTELLIGENTHPC WORKLOAD MANAGEMENTMOAB HPC SUITE - GRID OPTION transtec HPC solutions are designed for maximum flexibility, and ease of management. We not only offer our customers the most powerful and flexible cluster management solution out there, but also provide them with customized setup and site- specific configuration. Whether a customer needs a dynamical Linux-Windows dual-boot solution, unified management of dif- ferent clusters at different sites, or the fine-tuning of the Moab scheduler for implementing fine-grained policy configuration – transtec not only gives you the framework at hand, but also helps you adapt the system according to your special needs. Needless to say, when customers are in need of special train- ings, transtec will be there to provide customers, administrators, or users with specially adjusted Educational Services. Having a many years’ experience in High Performance Comput- ing enabled us to develop efficient concepts for installing and deploying HPC clusters. For this, we leverage well-proven 3rd party tools, which we assemble to a total solution and adapt and configure according to the customer’s requirements. We manage everything from the complete installation and configuration of the operating system, necessary middleware components like job and cluster management systems up to the customer’s applications. 46
  • schedule to be continually forecasted and adjusted along Scalable job performance to large-scale, extreme-scale, and with resource allocations to adapt to changes in conditions high-throughput computing environments and new job and reservation requests  Efficiently manages the submission and scheduling of Enhance job performance with automatic learning that im- hundreds of thousands of queued job submissions to sup- proves scheduling decisions based on historical workload port high throughput computing results  Fast scheduler response to user commands while sche- duling so users and administrators get the real-time jobAdvanced scheduling and management of GPGPUs for jobs to information they needmaximize their utilization  Fast job throughput rate to get results started and deli- Automatic detection and management of GPGPUs in the vered faster and keep utilization of resources up grid environment to eliminate manual configuration and  Scalable to manage up to 30 clusters when configured for make them immediately available for scheduling centralized or centralized and local grid decisions and Exclusively allocate and schedule GPGPUs on a per-job basis management Policy-based management & scheduling using GPGPU metrics Open and flexible decision engine structure easily integrates Quick access to statistics on GPGPU utilization and key with and automates management across existing heteroge- metrics for optimal management and issue diagnosis such neous resources and middleware to improve management as error counts, temperature, fan speed, and memory efficiency  Rich data integration and aggregation enables you to setEasier submission, management, and control of job arrays powerful, multi-dimensional policies based on the existingimprove user productivity and job throughput efficiency real-time resource data monitored without adding any new Users can easily submit thousands of sub-jobs with a single agents job submission with an array index differentiating each  Unify management across existing internal, external, and array sub-job partner clusters—even if they have different resource ma- Job array usage limit policies enforce number of job maxi- nagers, databases, operating systems, and hardware mums by credentials or class  Supports integration with job resource managers such as Simplified reporting and management of job arrays for end TORQUE and SLURM as well as integrating with many other users filters jobs to summarize, track and manage at the types of resource managers such as HP Cluster Manage- master job level ment Utility, Nagios, Ganglia, FlexLM, and others Speed job processing with enhanced grid placement options for job arrays; optimal or single cluster placement 47
  • INTELLIGENT Ease of use and management improves productivity for both us- ers and administratorsHPC WORKLOAD MANAGEMENT  Graphical administrator cluster management tool and portalMOAB HPC SUITE - GRID OPTION provides unified view of all grid operations to make self-diag- nosis, planning, reporting, and accounting across all resour- ces, jobs, and clusters easierMOAB HPC SUITE - GRID OPTION  Establish trust between resource owners through graphical usage controls, reports, and accounting across all shared resources  Tune policies prior to rollout with cluster- and grid-level simu- lations  Collaborate more effectively with multi-cluster co-allocation that allow key resource reservations for high-priority projects  Optional customizable end-user portal provides integrated job submission and management from any location, such as job forms, templates and start-time estimates, to reduce training and administrator requirements  Integrate job submission across multiple existing submission tools to reduce new end-user training requirements  Job templates enable rapid submission of common or multiple jobs by pre-specifying the variety of resources needed for each job to reduce duplicate work and simplifying job submissionsMOAB HPC SUITE - GRID OPTION for users ARCHITECTURE Moab HPC Suite – Grid Option is architected to manage on top of the existing multiple job resource managers and other types of resource managers across the multiple clusters in your grid environment to provide the unified policy-based scheduling and management of workloads and resource allocation. It makes the complex workload decisions based on all of the data it integrates 48
  • from the various resource managers and then orchestrates the management middleware to maximize workload control,job and management actions through those resource managers automation, and optimization.based on policies. This makes it the ideal choice to integrate withexisting and new systems and clusters as well as to manage your SYSTEM COMPATIBILITYgrid as it grows and expands in the future. Moab works with a variety of platforms. Many commonly usedMoab HPC Suite – Grid Option can be architected in three flexible resource managers, operating systems, and architectures aregrid management configurations; centralized, centralized and supported.local, or peer-to-peer grid policies, decisions and rules. Its unique  Operating system support: for Linux (Debian, Fedora,ability to manage multiple resource managers and multiple FreeBSD, RedHat, SUSE), Unix (AIX, Solaris)Moab instances makes this flexibility possible.  Resource Manager support: job resource managers such as TORQUE and SLURM as well as integrating with many otherMoab HPC Suite – Grid Option is designed with a patented intel- types of resource managers such as HP Cluster Managementligence engine architecture that enables it to integrate with and Utility, Nagios, Ganglia, FlexLM, and othersautomate workload management across existing heterogeneous  Hardware support: AMD x86, AMD Opteron, HP, Intel x86, Intelenvironments and complex multiple organizational priorities to IA-32, Intel IA-64, IBM i-Series, IBM p-Series, IBM x-Series.improve the management and workload efficiency of the envi-ronment. This unique architecture includes: Industry leading multi-dimensional policies that automate the complex real-time decisions and actions for scheduling workload and adapting resources. These multi-dimensional policies can model and consider the workload requirements, resource attributes and affinities, SLAs and priorities to enable more complex and efficient decisions to be automated. Real-time and predictive future environment scheduling & analytics that drive more accurate and efficient decisions and service guarantees as it can proactively adjust scheduling and resource allocations as it projects the impact of workload and resource condition changes. Open & flexible management abstraction layer lets you integrate the data and orchestrate workload actions across the chaos of complex heterogeneous IT environments and 49
  • HPC systems and enterprise grids deliver unprecedented time-to-market and performance advantages to many research and cor-porate customers, struggling every day with compute and dataintensive processes. This often generates or transforms massiveamounts of jobs and data, that needs to be handled and archivedefficiently to deliver timely information to users distributed inmultiple locations, with different security concerns.Poor usability of such complex systems often negatively impactsusers’ productivity, and ad-hoc data management often increasesinformation entropy and dissipates knowledge and intellectualproperty. 51
  • NICE ENGINE FRAME Solving distributed computing issues for our customers, it is easy to understand that a modern, user-friendly web front-endA TECHNICAL PORTAL FOR REMOTE VISUALIZATION to HPC and grids can drastically improve engineering productiv- ity, if properly designed to address the specific challenges of the Technical Computing market. Nice EnginFrame overcomes many common issues in the areasFIGURE 1 of usability, data management, security and integration, open- ing the way to a broader, more effective use of the Technical Computing resources. The key components of our solutions are:  a flexible and modular Java-based kernel, with clear separati- on between customizations and core services  powerful data management features, reflecting the typical needs of engineering applications  comprehensive security options and a fine grained authori- zation system  scheduler abstraction layer to adapt to different workload and resource managers  responsive and competent support services End users can typically enjoy the following improvements:  user-friendly, intuitive access to computing resources, using a standard web browser  application-centric job submission  organized access to job information and data  increased mobility and reduced client requirements On the other side, the Technical Computing Portal delivers sig- nificant added-value for system administrators and IT:  reduced training needs to enable users to access the resour- ces 52
  •  centralized configuration and immediate deployment of with related departments or external partners and eases services communication through a secure infrastructure. comprehensive authorization to access services and infor- mation  Distributed Data Management reduced support calls and submission errors The comprehensive remote file management built into EnginFrame avoids unnecessary file transfers, and enablesCoupled with our Remote Visualization solutions, our custom- server-side treatment of data, as well as file transfer from/toers quickly deploy end-to-end engineering processes on their the user’s desktop.Intranet, Extranet or Internet.  Interactive application supportENGINFRAME HIGHLIGHTS Through the integration with leading GUI virtualization Universal and flexible access to your Grid infrastructure solutions (including Desktop Cloud Visualization (DCV), VNC, EnginFrame provides easy access over intranet, extranet or VirtualGL and NX), EnginFrame enables you to address all sta- the Internet using standard and secure Internet languages ges of the design process, both batch and interactive. and protocols.  SOA-enable your applications and resources Interface to the Grid in your organization EnginFrame can automatically publish your computing ap- EnginFrame offers pluggable server-side XML services to plications via standard WebServices (tested both for Java and enable easy integration of the leading workload schedulers .NET interoperability), enabling immediate integration with in the market. Plug-in modules for all major grid solutions other enterprise applications. including the Platform Computing LSF and OCS, Microsoft Windows HPC Server 2008, Sun GridEngine, UnivaUD BUSINESS BENEFITS UniCluster, IBM LoadLeveler, Altair PBS/Pro and OpenPBS, Category Business Benefits Torque, Condor, EDG/gLite Globus-based toolkits, and provides easy interfaces to the existing computing infra- Web and Cloud Portal  Simplify user access; minimize structure. the training requirements for HPC users; broaden the user commu- Security and Access Control nity You can give encrypted and controllable access to remote  Enable access from intranet, inter- users and improve collaboration with partners while protec- net or extranet ting your infrastructure and Intellectual Property (IP). This enables you to speed up design cycles while working 53
  • NICE ENGINE FRAMEAPPLICATION HIGHLIGHTS AEROSPACE, AUTOMOTIVE AND MANUFACTURING The complex factors involved in CAE range from compute- intensive data analysis to worldwide collaboration between designers, engineers, OEM’s and suppliers. To accommodate these factors, Cloud (both internal and external) and Grid Computing solutions are increasingly seen as a logical step to optimize IT infrastructure usage for CAE. It is no surprise then, that automotive and aerospace manufacturers were the early adopters of internal Cloud and Grid portals. Manufacturers can now develop more “virtual products” and simulate all types of designs, fluid flows, and crash simula- tions. Such virtualized products and more streamlined col- laboration environments are revolutionizing the manufactur- ing process. With NICE EnginFrame in their CAE environment engineers can take the process even further by connecting design and simula- tion groups in “collaborative environments” to get even greater benefits from “virtual products”. Thanks to EnginFrame, CAE en- 54
  • gineers can have a simple intuitive collaborative environment LIFE SCIENCES AND HEALTHCAREthat can care of issues related to: NICE solutions are deployed in the Life Sciences sector at Access & Security - Where an organization must give access to companies like BioLab, Partners HealthCare, Pharsight and the external and internal entities such as designers, engineers and M.D.Anderson Cancer Center; also in leading research projects suppliers. like DEISA or LitBio in order to allow easy and transparent use of Distributed collaboration - Simple and secure connection of computing resources without any insight of the HPC infrastruc- design and simulation groups distributed worldwide. ture. Time spent on IT tasks - By eliminating time and resources spent using cryptic job submission commands or acquiring The Life Science and Healthcare sectors have some very strict knowledge of underlying compute infrastructures, engineers requirement when choosing its IT solution like EnginFrame, for can spend more time concentrating on their core design tasks. instanceEnginFrame’s web based interface can be used to access the com-  Security - To meet the strict security & privacy requirementspute resources required for CAE processes. This means access to of the biomedical and pharmaceutical industry any solu-job submission & monitoring tasks, input & output data associated tion needs to take account of multiple layers of security andwith industry standard CAE/CFD applications for Fluid-Dynamics, authentication.Structural Analysis, Electro Design and Design Collaboration, (like  Industry specific software - ranging from the simplest customAbaqus, Ansys, Fluent, MSC Nastran, PAMCrash, LS-Dyna, Radioss) tool to the more general purpose free and open middlewares.without cryptic job submission commands or knowledge of under-lying compute infrastructures. EnginFrame’s modular architecture allows for different Grid mid- dlewares and software including leading Life Science applicationsEnginFrame has a long history of usage in some of the most presti- like Schroedinger Glide, EPFL RAxML, BLAST family, Taverna, andgious manufacturing organizations worldwide including Aero- R) to be exploited, r Users can compose elementary services intospace companies like AIRBUS, Alenia Space, CIRA, Galileo Avionica, complex applications and “virtual experiments”, run, monitor andHamilton Sunstrand, Magellan Aerospace, MTU and Automotive build workflows via a standard Web browser. EnginFrame also hascompanies like Audi, ARRK, Brawn GP, Bridgestone, Bosch, Delphi, highly tuneable resource sharing and fine-grained access controlElasis, Ferrari, FIAT, GDX Automotive, Jaguar-LandRover, Lear, Mag- where multiple authentication systems (like Active Directory, Krb5neti Marelli, McLaren, P+Z, RedBull, Swagelok, Suzuki, Toyota, TRW. or LDAP) can be exploited simultaneously. 55
  • NICE ENGINE FRAME Web and Cloud Portal  “One Stop Shop” - the portal as theDESKTOP CLOUD VIRTUALIZATION starting point for knowledge and resources  Enable collaboration and sharing through the portal as a virtual meeting point (employees, part- ners, and affiliates)  Single Sign-on  Integrations for many technical computing ISV and open-source applications and infrastructure components Business Continuity  Virtualizes servers and server loca- tions, data and storage locations, applications and licenses  Move to thin desktops for graphi- cal workstation applications by moving graphics processing into the datacenter (also can increase utilization and re-use of licenses) “The amount of data resulting from e.g. simula-  Enable multi-site and global loca- tions in CAE or other engineering environments tions and services can be in the Gigabyte range. It is obvious that  Data maintained on network sto- remote post-processing is one of the most rage - not on laptop/desktop urgent topics to be tackled. NICE Engine Frame  Cloud and ASP – create, or take provides exactly that, and our customers are advantage of, Cloud and ASP impressed that such great technology enhanc- (Application Service Provider) es their workflow so significantly.” services to extend your business, grow new revenue, manage costs. Robin Kienecker HPC Sales Specialist 56
  • Compliance  Ensure secure and auditable and large memory nodes hosted in “Public or Private 3D Cloud”, access and use of resources rather then waiting for the next upgrade of the workstations. (applications, data, infrastructure, The DCV protocol adapts to heterogeneous networking infrastruc- licenses) tures like LAN, WAN and VPN, to deal with bandwidth and latency  Enables IT to provide better upti- constraints. All applications run natively on the remote machines, me and service-levels to users that could be virtualized and share the same physical GPU.  Allow collaboration with partners In a typical visualization scenario, a software application sends a while protecting Intellectual Pro- stream of graphics commands to a graphics adapter through an perty and resources input/output (I/O) interface. The graphics adapter renders the data  Restrict access by class of user, into pixels and outputs them to the local display as a video signal. service, application, and resource When using Nice DCV, the scene geometry and graphics state are rendered on a central server, and pixels are sent to one or Web Services  Wrapper legacy applications more remote displays. (command line, local API/SDK) This approach requires the server to be equipped with one or with a web services SOAP/.NET/ more GPUs, which are used for the OpenGL rendering, while the Java interface to access in your client software can run on “thin” devices. SOA environment  Workflow your human and auto- Nice DCV architecture consist of: mated business processes and  DCV Server, equipped with one or more GPUs, used for supply chain OpenGL rendering  Integrate with corporate policy  one or more DCV EndStations, running on “thin clients”, only for Enterprise Portal or Business used for visualization Process Management (works with  etherogeneous networking infrastructures (like LAN, WAN SharePoint®, WebLogic®, WebS- and VPN), optimized balancing quality vs frame rate phere®, etc). Nice DCV HighlightsDESKTOP CLOUD VIRTUALIZATION  enables high performance remote access to interactiveNice Desktop Cloud Visualization (DCV) is an advanced technol- 2D/3D software applications on low bandwidth/high latencyogy that enables Technical Computing users to remote access  supports multiple etherogeneous OS (Windows, Linux)2D/3D interactive applications over a standard network.  enables GPU sharingEngineers and scientists are immediately empowered by taking  supports 3D acceleration for OpenGL applications runningfull advantage of high-end graphics cards, fast I/O performance on Virtual Machines 57
  • NICE ENGINE FRAMEREMOTE VISUALIZATION REMOTE VISUALIZATION It’s human nature to want to ‘see’ the results from simulations, tests, and analyses. Up until recently, this has meant ‘fat’ work- stations on many user desktops. This approach provides CPU power when the user wants it – but as dataset size increases, there can be delays in downloading the results. Also, sharing the results with colleagues means gathering around the work- station - not always possible in this globalized, collaborative workplace. Increasing dataset complexity (millions of polygons, interacting components, MRI/PET overlays) means that as time comes to upgrade and replace the workstations, the next generation of hardware needs more memory, more graphics processing, moreFIGURE 2 disk, and more CPU cores. This makes the workstation expen- sive, in need of cooling, and noisy. Innovation in the field of remote 3D processing now allows companies to address these issues moving applications away from the Desktop into the data center. Instead of pushing data to the application, the application can be moved near the data. Instead of mass workstation upgrades, Remote Visualization allows incremental provisioning, on-demand allocation, better management and efficient distribution of interactive sessions 58
  • and licenses. Racked workstations or blades typically have  Single sign-on for batch and interactive applicationslower maintenance, cooling, replacement costs, and they can  All data transfers from and to the remote visualization farmextend workstation (or laptop) life as “thin clients”. are handled by EnginFrame  Built-in collaboration, to share sessions with other usersTHE SOLUTION  The load and usage of the visualization cluster is monitoredLeveraging their expertise in distributed computing and Web- in the browserbased application portals, NICE offers an integrated solution toaccess, load balance and manage applications and desktop ses- The solution also delivers significant added-value for the sys-sions running within a visualization farm. The farm can include tem administrators:both Linux and Windows resources, running on heterogeneous  No need of SSH / SCP / FTP on the client machinehardware.  Easy integration into identity services, Single Sign-On (SSO),The core of the solution is the EnginFrame Visualization plug-in, Enterprise portalsthat delivers Web-based services to access and manage applica-  Automated data life cycle managementtions and desktops published in the Farm. This solution has  Built-in user session sharing, to facilitate supportbeen integrated with:  Interactive sessions are load balanced by a scheduler NICE Desktop Cloud Visualization (DCV) (LSF, SGE or Torque) to achieve optimal performance and HP Remote Graphics Software (RGS) resource usage RealVNC  Better control and use of application licenses TurboVNC and VirtualGL  Monitor, control and manage users’ idle sessions Nomachine NX transtec has a long-term experience in Engineering environ-Coupled with these third party remote visualization engines ments, especially from the CAD/CAE sector. This allow us to(which specialize in delivering high frame-rates for 3D graphics), provide customers from this area with solutions that greatlythe NICE offering for Remote Visualization solves the issues of enhance their workflow and minimizes time-to-result.user authentication, dynamic session allocation, session man-agement and data transfers. This, together with transtec’s offerings of all kinds of services, allows our customers to fully focus on their productive work,End users can enjoy the following improvements: and have us do the environmental optimizations. Intuitive, application-centric Web interface to start, control and re-connect to a session 59
  • NICE ENGINE FRAME  Supports multiple user collaboration via session sharing  Enables attractive Return-on-Investment through resourceDESKTOP CLOUD VIRTUALIZATION sharing and consolidation to data centers (GPU, memory, CPU, ...)FIGURE 3  Keeps the data secure in the data center, reducing data load and save time  Enables right sizing of system allocation based on user’s dynamic needs  Facilitates application deployment: all applications, updates and patches are instantly available to everyone, without any changes to original code BUSINESS BENEFITS The business benefits for adopting Nice DCV can be summa- rized in to four categories: Category Business Benefits Productivity  Increase business efficiency  Improve team performance by ensuring real-time collaboration with colleagues and partners in real time, anywhere.  Reduce IT management costs by consolidating workstation resources to a single point-of- management  Save money and time on applica- tion deployment  Let users work from anywhere there is an Internet connection 60
  • Business Continuity  Move graphics processing and Windows Linux data to the datacenter - not on laptop/desktop  Microsoft Windows 7 -  RedHat Enterprise 4, 5, 5.5,  Cloud-based platform support 32/64 bit 6 - 32/64 bit enables you to scale the visualiza-  Microsoft Windows Vista -  SUSE Enterprise Server 11 tion solution „on-demand“ to ex- 32/64 bit - 32/64 bit tend business, grow new revenue,  Microsoft Windows XP - manage costs. 32/64 bit  Microsoft Windows Server Data Security  Guarantee secure and auditable 2008 R2 - (single user only) use of remote resources (appli- cations, data, infrastructure, licenses) © 2012 by NICE  Allow real-time collaboration with partners while protecting Intellec- tual Property and resources  Restrict access by class of user, service, application, and resource Training Effectiveness  Enable multiple users to follow application procedures along- side an instructor in real-time  Enable collaboration and session sharing among remote users (employees, partners, and affili- ates)Nice DCV is perfectly integrated into EnginFrame Views, lever-aging 2D/3D capabilities over the Web, including the ability toshare an interactive session with other users for collaborativeworking. 61
  • Intel Cluster Ready is designed to create predictable expectationsfor users and providers of HPC clusters, primarily targeting cus-tomers in the commercial and industrial sectors. These are notexperimental “test-bed” clusters used for computer science andcomputer engineering research, or high-end “capability” clustersclosely targeting their specific computing requirements that powerthe high-energy physics at the national labs or other specializedresearch organizations.Intel Cluster Ready seeks to advance HPC clusters used as com-puting resources in production environments by providing clusterowners with a high degree of confidence that the clusters theydeploy will run the applications their scientific and engineeringstaff rely upon to do their jobs. It achieves this by providing clusterhardware, software, and system providers with a precisely definedbasis for their products to meet their customers’ production clusterrequirements. 63
  • INTEL CLUSTER READY WHAT ARE THE OBJECTIVES OF ICR? The primary objective of Intel Cluster Ready is to make clustersA QUALITY STANDARD FOR HPC CLUSTERS easier to specify, easier to buy, easier to deploy, and make it easier to develop applications that run on them. A key feature of ICR is the concept of “application mobility”, which is defined as the ability of a registered Intel Cluster Ready application – more correctly, the same binary – to run correctly on any certi- fied Intel Cluster Ready cluster. Clearly, application mobility is important for users, software providers, hardware providers, and system providers.  Users want to know the cluster they choose will reliably run the applications they rely on today, and will rely on tomorrow  Application providers want to satisfy the needs of their cus- tomers by providing applications that reliably run on their customers’ cluster hardware and cluster stacks  Cluster stack providers want to satisfy the needs of their customers by providing a cluster stack that supports their customers’ applications and cluster hardware  Hardware providers want to satisfy the needs of their custo- mers by providing hardware components that supports their “The Intel Cluster Checker allows us to certify customers’ applications and cluster stacks that our transtec HPC clusters are compliant  System providers want to satisfy the needs of their custo- with an independent high quality standard. Our mers by providing complete cluster implementations that customers can rest assured: their applications reliably run their customers’ applications run as they expect.” Without application mobility, each group above must either try to support all combinations, which they have neither the time nor resources to do, or pick the “winning combination(s)” that best supports their needs, and risk making the wrong Marcus Wiedemann HPC Solution Engineer choice. 64
  • The Intel Cluster Ready definition of application portability “required” set, then Intel Cluster Ready requires the applica-supports all of these needs by going beyond pure portability, tion to provide that software as a part of its installation. To(re-compiling and linking a unique binary for each platform), to ensure that this additional per-application software doesn’tapplication binary mobility, (running the same binary on mul- conflict with the cluster stack or other applications, Inteltiple platforms), by more precisely defining the target system. Cluster Ready also requires the additional software to be in- stalled in application-private trees, so the application knowsA further aspect of application mobility is to ensure that how to find that software while not interfering with otherregistered Intel Cluster Ready applications do not need special applications. While this may well cause duplicate softwareprogramming or alternate binaries for different message to be installed, the reliability provided by the duplication farfabrics. Intel Cluster Ready accomplishes this by providing an outweighs the cost of the duplicated files. A prime exampleMPI implementation supporting multiple fabrics at runtime; supporting this comparison is the removal of a common filethrough this, registered Intel Cluster Ready applications obey (library, utility, or other) that is unknowingly needed by somethe “message layer independence property”. Stepping back, the other application – such errors can be insidious to repairunifying concept of Intel Cluster Ready is “one-to-many,” that is even when they cause an outright application failure. One application will run on many clusters One cluster will run many applications Cluster platforms, at the bottom of the stack, provide the APIs, utilities, and file system structure relied upon by registeredHow is one-to-many accomplished? Looking at Figure 1, you applications. Certified Intel Cluster Ready platforms ensuresee the abstract Intel Cluster Ready “stack” components that FIGURE 1 ICR STACKalways exist in every cluster, i.e., one or more applications, a Registered CFD Crash Climate QCD Bio ...cluster software stack, one or more fabrics, and finally the un- Applications Optionalderlying cluster hardware. The remainder of that picture (to the Intel Development Intel MPI Library (run-time) Software Stackright) shows the components in greater detail. Intel MKL Cluster Edition (run-time) Tools (C++, Intel Trace Analyzer and A Single Collector, MKL, etc.) Solution PlatformApplications, on the top of the stack, rely upon the various Linux Cluster Tools Value Add by (Intel Selected) IndividualAPIs, utilities, and file system structure presented by the Platform Integratorunderlying software stack. Registered Intel Cluster Ready Fabrics Fabric Gigabit Ethernet InfiniBand (OFED) 10Gbit Ethernetapplications are always able to rely upon the APIs, utilities, Certified System Intel OEM1 OEM2 PI1 PI2 ...and file system structure specified by the Intel Cluster Ready Cluster Platforms Intel Xeon Processor PlatformSpecification; if an application requires software outside this 65
  • INTEL CLUSTER READY the APIs, utilities, and file system structure are complete per the Intel Cluster Ready Specification; certified clusters are ableA QUALITY STANDARD FOR HPC CLUSTERS to provide them by various means as they deem appropriate. Because of the clearly defined responsibilities ensuring the presence of all software required by registered applications, system providers have a high confidence that the certified clusters they build are able to run any certified applications their customers rely on. In addition to meeting the Intel Cluster Ready requirements, certified clusters can also provide their added value, that is, other features and capabilities that increase the value of their products. HOW DOES INTEL CLUSTER READY ACCOMPLISH ITS OBJECTIVES? At its heart, Intel Cluster Ready is a definition of the cluster as a parallel application platform, as well as a tool to certify an actual cluster to the definition. Let’s look at each of these in more detail, to understand their motivations and benefits. A definition of the cluster as parallel application platform The Intel Cluster Ready Specification is very much written as the requirements for, not the implementation of, a platform upon which parallel applications, more specifically MPI ap- plications, can be built and run. As such, the specification doesn’t care whether the cluster is diskful or diskless, fully distributed or single system image (SSI), built from “Enter- prise” distributions or community distributions, fully open source or not. Perhaps more importantly, with one exception, the specification doesn’t have any requirements on how the cluster is built; that one exception is that compute nodes must be built with automated tools, so that new, repaired, or replaced nodes can be rebuilt identically to the existing nodes without any manual interaction, other than possibly initiating the build process. 66
  • Some items the specification does care about include: quires, e.g., an alternate MPI, must also provide the runtimes The ability to run both 32- and 64-bit applications, including for that MPI as a part of its installation. MPI applications and X-clients, on any of the compute nodes Consistency among the compute nodes’ configuration, capa- A tool to certify an actual cluster to the definition bility, and performance The Intel Cluster Checker, included with every certified Intel The identical accessibility of libraries and tools across Cluster Ready implementation, is used in four modes in the the cluster life of a cluster: The identical access by each compute node to permanent  To certify a system provider’s prototype cluster as a valid and temporary storage, as well as users’ data implementation of the specification The identical access to each compute node from the  To verify to the owner that the just-delivered cluster is a head node “true copy” of the certified prototype The MPI implementation provides fabric independence  To ensure the cluster remains fully functional, reducing ser- All nodes support network booting and provide a vice calls not related to the applications or the hardware remotely accessible console  To help software and system providers diagnose and correct actual problems to their code or their hardware.The specification also requires that the runtimes for specificIntel software products are installed on every certified cluster: While these are critical capabilities, in all fairness, this greatly Intel Math Kernel Library understates the capabilities of Intel Cluster Checker. The tool Intel MPI Library Runtime Environment will not only verify the cluster is performing as expected. To do Intel Threading Building Blocks this, per-node and cluster-wide static and dynamic tests are made of the hardware and software.This requirement does two things. First and foremost, main- FIGURE 2 INTEL CLUSTER CHECKERline Linux distributions do not necessarily provide a sufficient Cluster Definition & STDOUT + Logfilesoftware stack to build an HPC cluster – such specialization Configuration XML File Pass/Fail Results & Diagnosticsis beyond their mission. Secondly, the requirement ensuresthat programs built with this software will always work on Cluster Checker Enginecertified clusters and enjoy simpler installations. As theseruntimes are directly available from the web, the requirement Output Output Config Config API Result Resultdoes not cause additional costs to certified clusters. It is alsovery important to note that this does not require certified ap- Test Module Test Moduleplications to use these libraries nor does it preclude alternate Parallel Ops Check Parallel Ops Checklibraries, e.g., other MPI implementations, from being present Node Node Node Node Node Node Node Nodeon certified clusters. Quite clearly, an application that re- 67
  • INTEL CLUSTER READY The static checks ensure the systems are configured consis- tently and appropriately. As one example, the tool will ensureINTEL CLUSTER READY BUILDS HPC MOMENTUM the systems are all running the same BIOS versions as well as having identical configurations among key BIOS settings. This Intel, XEON and certain other trademarks and logos type of problem – differing BIOS versions or settings – can be appearing in this brochure, are trademarks or regis- the root cause of subtle problems such as differing memory tered trademarks of Intel Corporation. configurations that manifest themselves as differing memory bandwidths only to be seen at the application level as slower than expected overall performance. As is well-known, paral- lel program performance can be very much governed by the performance of the slowest components, not the fastest. In another static check, the Intel Cluster Checker will ensure that the expected tools, libraries, and files are present on each node, identically located on all nodes, as well as identically implemented on all nodes. This ensures that each node has the minimal software stack specified by the specification, as well as identical software stack among the compute nodes. A typical dynamic check ensures consistent system perfor- mance, e.g., via the STREAM benchmark. This particular test en- sures processor and memory performance is consistent across compute nodes, which, like the BIOS setting example above, can be the root cause of overall slower application performance. An additional check with STREAM can be made if the user config- ures an expectation of benchmark performance; this check will ensure that performance is not only consistent across the clus- ter, but also meets expectations. Going beyond processor per- formance, the Intel MPI Benchmarks are used to ensure the net- work fabric(s) are performing properly and, with a configuration that describes expected performance levels, up to the cluster 68
  • provider’s performance expectations. Network inconsistenciesdue to poorly performing Ethernet NICs, InfiniBand HBAs, faultyswitches, and loose or faulty cables can be identified. Finally,the Intel Cluster Checker is extensible, enabling additional teststo be added supporting additional features and capabilities.This enables the Intel Cluster Checker to not only support theminimal requirements of the Intel Cluster Ready Specification,but the full cluster as delivered to the customer.FIGURE 3 INTEL CLUSTER READY PROGRAM INTEL CLUSTER READY BUILDS HPC MOMENTUM With the Intel Cluster Ready (ICR) program, Intel Corporation Software Tools Cluster Platform set out to create a win-win scenario for the major constituen- cies in the high-performance computing (HPC) cluster market. Reference Designs Intel© Processor Hardware vendors and independent software vendors (ISVs) Server Platform Configuration Interconnect Demand Creation stand to win by being able to ensure both buyers and users Software that their products will work well together straight out of Specification the box. System administrators stand to win by being able to ISV Enabling meet corporate demands to push HPC competitive advantages deeper into their organizations while satisfying end users’ Support & Training demands for reliable HPC cycles, all without increasing IT staff. End users stand to win by being able to get their work doneConforming hardware and software faster, with less downtime, on certified cluster platforms. LastThe preceding was primarily related to the builders of certi- but not least, with ICR, Intel has positioned itself to win byfied clusters and the developers of registered applications. expanding the total addressable market (TAM) and reducingFor end users that want to purchase a certified cluster to run time to market for the company’s microprocessors, chip sets,registered applications, the ability to identify registered ap- and platforms.plications and certified clusters is most important, as that willreduce their effort to evaluate, acquire, and deploy the clus- The Worst of Timesters that run their applications, and then keep that computing For a number of years, clusters were largely confined to govern-resource operating properly, with full performance, directly ment and academic sites, where contingents of graduate stu-increasing their productivity. dents and midlevel employees were available to help program 69
  • INTEL CLUSTER READY and maintain the unwieldy early systems. Commercial firms lacked this low-cost labor supply and mistrusted the favoredINTEL CLUSTER READY BUILDS HPC MOMENTUM cluster operating system, open source Linux, on the grounds that no single party could be held accountable if something went wrong with it. Today, cluster penetration in the HPC mar- ket is deep and wide, extending from systems with a handful of processors to some of the world’s largest supercomputers, and from under $25,000 to tens or hundreds of millions of dollars in price. Clusters increasingly pervade every HPC vertical market: biosciences, computer-aided engineering, chemical engineer- ing, digital content creation, economic/financial services, electronic design automation, geosciences/geo-engineering, mechanical design, defense, government labs, academia, and weather/climate. But IDC studies have consistently shown that clusters remain difficult to specify, deploy, and manage, especially for new and less experienced HPC users. This should come as no surprise, given that a cluster is a set of independent computers linked to- gether by software and networking technologies from multiple vendors. Clusters originated as do-it-yourself HPC systems. In the late 1990s users began employing inexpensive hardware to cobble together scientific computing systems based on the “Beowulf cluster” concept first developed by Thomas Sterling and Don- ald Becker at NASA. From their Beowulf origins, clusters have evolved and matured substantially, but the system manage- ment issues that plagued their early years remain in force today. 70
  • The Need for Standard Cluster Solutions help drive HPC cluster resources deeper into larger organiza-The escalating complexity of HPC clusters poses a dilemma for tions and free up IT staff to focus on mainstream enterprisemany large IT departments that cannot afford to scale up their applications (e.g., payroll, sales, HR, and CRM).HPC-knowledgeable staff to meet the fast-growing end-user de-mand for technical computing resources. Cluster management The program is a three-way collaboration among hardwareis even more problematic for smaller organizations and busi- vendors, software vendors, and Intel. In this triple alliance, Intelness units that often have no dedicated, HPC-knowledgeable provides the specification for the cluster architecture imple-staff to begin with. mentation, and then vendors certify the hardware configura- tions and register software applications as compliant with theThe ICR program aims to address burgeoning cluster complexity specification. The ICR program’s promise to system administra-by making available a standard solution (aka reference archi- tors and end users is that registered applications will run out oftecture) for Intel-based systems that hardware vendors can use the box on certified hardware configurations.to certify their configurations and that ISVs and other softwarevendors can use to test and register their applications, system ICR solutions are compliant with the standard platform archi-software, and HPC management software. The chief goal of tecture, which starts with 64-bit Intel Xeon processors in anthis voluntary compliance program is to ensure fundamental Intel-certified cluster hardware platform. Layered on top ofhardware-software integration and interoperability so that this foundation are the interconnect fabric (Gigabit Ethernet,system administrators and end users can confidently purchase InfiniBand) and the software stack: Intel-selected Linux clusterand deploy HPC clusters, and get their work done, even in cases tools, an Intel MPI runtime library, and the Intel Math Kernelwhere no HPC-knowledgeable staff are available to help. Library. Intel runtime components are available and verified as part of the certification (e.g., Intel tool runtimes) but are notThe ICR program wants to prevent end users from having to required to be used by applications. The inclusion of these Intelbecome, in effect, their own systems integrators. In smaller runtime components does not exclude any other components aorganizations, the ICR program is designed to allow over- systems vendor or ISV might want to use. At the top of the stackworked IT departments with limited or no HPC expertise to are Intel-registered ISV applications.support HPC user requirements more readily. For larger or-ganizations with dedicated HPC staff, ICR creates confidence At the heart of the program is the Intel Cluster Checker, a valida-that required user applications will work, eases the problem tion tool that verifies that a cluster is specification compliantof system administration, and allows HPC cluster systems and operational before ISV applications are ever loaded. Afterto be scaled up in size without scaling support staff. ICR can the cluster is up and running, the Cluster Checker can function 71
  • INTEL CLUSTER READY as a fault isolation tool in wellness mode. Certification needsTHE TRANSTEC BENCHMARKING CENTER to happen only once for each distinct hardware platform, while verification – which determines whether a valid copy of the specification is operating – can be performed by the Cluster Checker at any time. Cluster Checker is an evolving tool that is designed to accept new test modules. It is a productized tool that ICR members ship with their systems. Cluster Checker originally was designed for homogeneous clusters but can now also be applied to clusters with specialized nodes, such as all-storage sub-clusters. Cluster Checker can isolate a wide range of problems, including network or communication problems. 72
  • high-level cluster management, transtec is able to efficiently deploy and ship easy-to-use HPC cluster solutions with enter- prise-class management features. Moab has proven to provide easy-to-use workload and job management for small systems as well as the largest cluster installations worldwide. However, when selling clusters to governmental customers as well as other large enterprises, it is often required that the client cantranstec offers their customers a new and fascinating way to choose from a range of competing offers. Many times there is aevaluate transtec’s HPC solutions in real world scenarios. With fixed budget available and competing solutions are comparedthe transtec Benchmarking Center solutions can be explored in based on their performance towards certain custom bench-detail with the actual applications the customers will later run mark codes.on them. Intel Cluster Ready makes this feasible by simplifyingthe maintenance of the systems and set-up of clean systems very So, in 2007 transtec decided to add another layer to their alreadyeasily, and as often as needed. As High-Performance Computing wide array of competence in HPC – ranging from cluster deploy-(HPC) systems are utilized for numerical simulations, more and ment and management, the latest CPU, board and network tech-more advanced clustering technologies are being deployed. nology to HPC storage systems. In transtec’s HPC Lab the systemsBecause of its performance, price/performance and energy are being assembled. transtec is using Intel Cluster Ready toefficiency advantages, clusters now dominate all segments of facilitate testing, verification, documentation, and final testingthe HPC market and continue to gain acceptance. HPC computer throughout the actual build process. At the benchmarking centersystems have become far more widespread and pervasive in gov- transtec can now offer a set of small clusters with the “newesternment, industry, and academia. However, rarely does the client and hottest technology” through Intel Cluster Ready. A standardhave the possibility to test their actual application on the system installation infrastructure gives transtec a quick and easy way tothey are planning to acquire. set systems up according to their customers’ choice of operating system, compilers, workload management suite, and so on. WithTHE TRANSTEC BENCHMARKING CENTER Intel Cluster Ready there are prepared standard set-ups availabletranstec HPC solutions get used by a wide variety of clients. with verified performance at standard benchmarks while theAmong those are most of the large users of compute power at system stability is guaranteed by our own test suite and the IntelGerman and other European universities and research centers Cluster Checker.as well as governmental users like the German army’s computecenter and clients from the high tech, the automotive and other The Intel Cluster Ready program is designed to provide a com-sectors. transtec HPC solutions have demonstrated their value mon standard for HPC clusters, helping organizations designin more than 500 installations. Most of transtec’s cluster systems and build seamless, compatible and consistent cluster configu-are based on SUSE Linux Enterprise Server, Red Hat Enterprise rations. Integrating the standards and tools provided by thisLinux, CentOS, or Scientific Linux. With xCAT for efficient cluster program can help significantly simplify the deployment anddeployment, and Moab HPC Suite by Adaptive Computing for management of HPC clusters. 73
  • Windows HPC Server 2008 R2 is the third version of the Microsoftsolution for high performance computing (HPC). Built on WindowsServer 2008 R2 64-bit technology, Windows HPC Server 2008 R2efficiently scales to thousands of nodes and integrates seamlesslywith Windows-based IT infrastructures, providing a powerful com-bination of ease-of-use, low ownership costs, and performance.Compared to the previous version, Windows HPC Server 2008 R2delivers significant improvements in several areas. Through theseenhancements, Windows HPC Server 2008 R2 makes it easier thanever for companies to benefit from high-performance computing.System administrators can more easily deploy and manage power-ful HPC solutions, developers can more easily build applications,and end users can more easily access those solutions from theirWindows-based desktops. 75
  • WINDOWS HPC SERVER 2008 R2 ELEMENTS OF THE MICROSOFT HPC SOLUTION Windows HPC Server 2008 R2 combines the underlying stabil- ELEMENTS OF THE MICROSOFT HPC SOLUTION ity and security of Windows Server 2008 R2 with the features of Microsoft HPC Pack 2008 R2 to provide a robust, scalable, cost-effective, and easy-to-use HPC solution. A basic Windows HPC Server 2008 R2 solution is composed of a cluster of serv- ers, with a single head node (or a primary and backup head node in a highly available configuration) and one or more FIGURE 1 WINDOWS HPC CLUSTER SETUP compute nodes (see Figure 1). The head node controls and mediates all access to the cluster resources and is the single point of management, deployment, and job scheduling for Active File System the cluster. Windows HPC Server 2008 R2 can integrate with Center Directory Server Server Mail an existing Active Directory infrastructure for security and Server Enterprise Network account management, and can use Microsoft System CenterWorkstation Operations Manager for data center monitoring. Windows Optional Failover Cluster Node HPC Server 2008 R2 uses Microsoft SQL Server 2008 as a data Head repository for the head node. Windows HPC Server 2008 R2 SQL Server (Failover Cluster) WCF Broker Node Node SQL Server can take advantage of the failover clustering capabilities provided in Windows Server 2008 R2 Enterprise and some Application Network Private Network editions of Microsoft SQL Server to provide high-availability failover clustering for the head node. With clustering, in the event of a head node failure, the Job Scheduler will auto- Compute Compute Compute Node Node Node matically – or manually, if desired – fail over to a second server. Job Scheduler clients see no change in the head node during the failover and fail-back processes, helping to ensure uninterrupted cluster operation. Windows HPC Server 2008 R2 adds support for a remote head node database, enabling organizations to take advantage of an existing enterprise database. 76
  • Feature Implementation BenefitsOperating system Windows Server 2008 and/or Windows Serv- Inherits security and stability features from Windows Server 2008 and er 2008 R2 (Head node is R2 only, compute Windows Server 2008 R2. nodes can be both)Processor type x64 (AMD64 or Intel EM64T) Large memory model and processor efficiencies of x64 architecture.Node deployment Windows Deployment Services Image-based deployment, with full suport for multicasting and diskless boot.Head node redundancy Windows Failover Clustering and SQL Provides a fully redundant head node and scheduler (requires Windows Server Failover Clustering Server 2008 R2 Enterprise and SQL Server Standard Edition).Management Integrated Administration Console Provides a single user interface for all aspects of node and job manage- ment, grouping, monitoring, diagnostics, and reporting.Network topology Network Configuration Wizard Fully automated Network Configuration Wizard for configuring the desired network topology.Application network MS-MPI High-speed application network stack using NetworkDirect. Shared memory implementation for multicore processors. Highly compatible with existing MPICH2 implementations.Scheduler Job Manager Console GUI is integrated into the Administration Console or can be used standalone. Command line interface supports Windows PowerShell scripting and legacy command-line scripts from Windows Compute Cluster Server. Greatly improved speed and scalability. Support for SOA applications.Monitoring Integrated into Administration Console New heat map provides at-a-glance view of cluster performance and status for up to 1,000 nodes.Reporting Integrated into Administration Console Standard, prebuilt reports and historical performance charts. Additional reports can be created using SQL Server Analysis Services.Diagnostics Integrated into Administration Console Out-of-the-box verification and performance tests, with the ability to store, filter, and view test results and history. An extensible diagnostic framework for creating custom diagnostics and reports.Parallel runtime Enterprise-ready SOA infrastructure Windows HPC Server 2008 R2 provides enhanced support for SOA workloads, helping organizations more easily build interactive HPC ap- plications, make them more resilient to failure, and more easily manage those applications. 77
  • WINDOWS HPC SERVER 2008 R2 DEPLOYMENT One challenge to the adoption of HPC solutions lies in theDEPLOYMENT, SYSTEM MANAGEMENT, AND MONITORING deployment of large clusters. With a design goal of supporting the deployment of 1,000 nodes in less than an hour, Windows HPC Server 2008 R2 builds on the capabilities provided by the Windows Deployment Services transport to simplify and streamline the deployment and updating of cluster nodes, using Windows Imaging Format (WIM) files and multiband multicast to rapidly deploy compute nodes in parallel. Graphi- cal deployment tools are integrated into the Administration Console, including Node Templates for easily defining the configuration of compute nodes. New features in Windows HPC Server 2008 R2 – such as support for Windows Server 2008-based, mixed version clusters, and diskless boot – provide additional flexibility, enabling organizations to easily deploy solutions that are optimized to meet their needs. Node Templates in Windows HPC Server 2008 R2 provide an easy way to define the desired configuration of compute nodes, with each Node Template including the base operat- ing system image, drivers, configuration parameters, and, if “The performance of transtec HPC systems desired, additional software. A Node Template Generation combined with the usability of Windows HPC Wizard guides the administrator through the process of creat- Server 2008 R2 provides our customers with ing Node Templates, including support for injecting drivers HPC solutions that are unrivalled in power as into images. An improved Template Editor provides advanced well as ease of management.” configuration capabilities, including configuring Node Tem- plates for automatic application deployment. Windows HPC Server 2008 R2 supports the deployment of compute nodes Robin Kienecker HPC Sales Specialist and broker nodes based on Windows Server 2008 or Windows Server 2008 R2, including mixed-version clusters. 78
  • Diskless booting of compute nodes, a new feature in Windows Windows HPC Server 2008 R2 include additional criteria for filteringHPC Server 2008 R2, is provided through support for iSCSI boot views, support for location-based node grouping, a richer reportingfrom a storage array. This mechanism uses DHCP reservations for database for building custom reports, and an extensible diagnosticmapping to disk and leverages the storage vendor’s mechanism framework.for creating differencing disks for compute nodes. The Adminis- FIGURE 2 THE ADMINISTRATION CONSOLEtration Console includes diagnostic tests that can be used post-deployment to detect common problems, monitor node load-ing, and view job status across the cluster. In addition, the new“Lizard” (LINPACK Wizard) in Windows HPC Server 2008 R2 enablesadministrators to heavily load the cluster – thereby providing anefficient mechanism for detecting issues related to configurationand deployment, networking, power, cooling, and so on.SYSTEM MANAGEMENTAnother major challenge that organizations can face is the man-agement and administration of HPC clusters. This has traditionallybeen a departmental or organizational-level challenge, requiringone or more dedicated IT professionals to manage and deploynodes. At the same time, users submitting batch jobs are compet-ing for limited HPC resources. Windows HPC Server 2008 R2 isdesigned to facilitate ease-of-management. It provides a graphical MONITORING, REPORTING, AND DIAGNOSTICSAdministration Console that puts all the tools required for system The Node Management pane within the Administration Consolemanagement at an administrator’s fingertips, including the abil- is used to monitor node status and initiate node-specific actions.ity to easily drill down on node details such as metrics, logs, and New node management-related features in Windows HPC Serverconfiguration status. Support for the Windows PowerShell script- 2008 R2 include an enhanced heat map with overlay view, ad-ing language facilitates the automation of system administration ditional filtering criteria, customizable tabs, and location-basedtasks. An enhanced “heat map” view provides system administra- node grouping. In Windows HPC Server 2008 R2, the heat map hastors with an “at a glance” view of the cluster status, including the been enhanced to provide an at-a-glance view of system healthability to define tabs with different views of system health and and performance for clusters upwards of 1,000 nodes. Systemresource usage. Other new management-related features in administrators can define and prioritize up to three metrics (as 79
  • WINDOWS HPC SERVER 2008 R2 well as minimum and maximum thresholds for each metric) to build customized views of cluster health and status.JOB SCHEDULING Windows HPC Server 2008 R2 provides a set of prebuilt reports and charts to help system administrators understand system status, usage, and performance. Accessed through the ReportsFIGURE 3 THE HEAT MAP VIEW GIVES INSTANT FEEDBACK ON THEHEALTH OF THE CLUSTER and Charts tab on the Administrator Console, these prebuilt reports span four main categories: node availability, job resource usage, job throughput, and job turnaround. Windows HPC Server 2008 R2 also provides a set of prebuilt diagnostic reports to help system administrators verify that their clusters are working properly, along with a systematic way of running the tests and storing and viewing results. This significantly im- proves an administrator’s experience in verifying deployment, troubleshooting failures, and detecting performance degrada- tion. Cluster administrators can view a list of these diagnostic tests, run them, change diagnostic parameters at runtime, and view the results using the Diagnostics tab in the Administration Console or by using Windows PowerShell commands. JOB SCHEDULING The Job Scheduler queues jobs and their associated tasks, al- locates resources to the jobs, initiates the tasks on the computeFIGURE 4 DIAGNOSTIC PANE nodes, and monitors the status of jobs and tasks. In Windows HPC Server 2008 R2, the Job Scheduler has been enhanced to support larger clusters, more jobs, and larger jobs – including improved scheduling and task throughput at scale. It includes new policies for greater flexibility and resource utilization, and is built to address both traditional batch jobs as well as newer service-oriented applications. 80
  • The Job Scheduler supports both command-line and graphical matchmaking (targeting of jobs to specific types of nodes),interfaces. The graphical interface is provided through the Job growing and shrinking of jobs, backfill, exclusive scheduling,Scheduling tab of the Administration Console or through the and task dependencies for creating workflows.HPC Job Manager, a graphical interface for use by end-userssubmitting and managing jobs. Other supported interfaces Windows HPC Server 2008 R2 also introduces prep and releaseinclude: tasks – tasks that are run before and after a job. Prep tasks are Command line (cmd.exe) guaranteed to run once on each node in a job before any other Windows PowerShell 2.0 tasks, as may be required to support setup or validation of node COM and .NET application programming interfaces to sup- before the job is run. Release tasks are guaranteed to run once port a variety of languages, including VBScript, Perl, Fortran, on each node in a job after all other tasks, as may be required to C/C++, C#, and Java. clean up or transfer files after the job. The Open Grid Forum’s HPC Basic Profile Web Services Inter- face, which supports job submission and monitoring from many platforms and languages.The Windows HPC Server 2008 R2 interfaces are fully backwardscompatible, allowing job submission and management fromMicrosoft Compute Cluster Server and Windows HPC Server 2008interfaces.Windows HPC Server 2008 R2 also provides a new user interfacefor showing job progress, and an enhanced API that enablesdevelopers to report more detailed job progress status to theirHPC applications.Scheduling policies determine how resources are allocated tojobs. Windows HPC Server 2008 R2 provides the ability to switchbetween traditional first-come, first-serve scheduling and a newservice-balanced scheduling policy designed for SOA/dynamic(grid) workloads, with support for preemption, heterogeneous 81
  • WINDOWS HPC SERVER 2008 R2SERVICE-ORIENTED ARCHITECTURE SERVICE-ORIENTED ARCHITECTURE With the number and size of problems being tackled on ever- larger clusters continuing to grow, organizations face increased challenges in developing HPC applications. Not only must these applications must be built quickly, but they must run efficiently and be managed in a way that optimizes application perfor- mance, reliability, and resource utilization. One approach to meeting these challenges is a service-ori- ented architecture (SOA) – an approach to building distrib- uted, loosely coupled applications in which functions are separated into distinct services that can be distributed over a network, combined, and reused. Windows HPC Server 2008 R2 provides enhanced support for SOA workloads, helping organizations more easily build interactive HPC applications, make them more resilient to failure, and more easily manage those applications – capabilities that open the door to new application scenarios in areas such as financial trading and risk management. When SOA Can Be Useful – and How it Works on a Cluster HPC applications submitted to compute clusters are typically classified as either message intensive or embarrassingly paral- lel. While message-intensive applications comprise sequential 82
  • tasks, embarrassingly parallel problems can be easily divided Foundation effectively hiding the complexity of data serializa-into large numbers of parallel tasks, with no dependency or tion and distributed computing.communication between them. To solve these embarrassingly  Fire-and Recollect Programming Model:parallel problems without having to write low-level code, A fire-and-recollect programming model – sometimes calleddevelopers need to encapsulate core calculations as a software fire-and-forget – is a common approach to building long-modules. An SOA approach to development makes this encapsu- running SOA applications. The SOA runtime in Windows HPClation not only possible but easy, effectively hiding the details Server 2008 R2 adds support for fire-and-recollect program-of data serialization and distributed computing. ming, enabling developers to implement reattachable ses- sions by decoupling requests and responses.With Windows HPC Server 2008 R2, tasks can run interactively as  Durable Sessions:SOA applications. For interactive SOA applications, in addition Another new feature in the Windows HPC Server 2008 R2 isto a head node and one or more compute nodes, the cluster the ability to implement durable sessions, where the SOAalso includes one or more Windows Communication Founda- runtime persists requests and their corresponding respons-tion broker nodes. The broker nodes act as intermediaries be- es on behalf of the client.tween the client application and the Windows Communication  Finalization Hooks:Foundation hosts running on compute nodes, load-balancing The SOA runtime in Windows HPC Server 2008 R2 also addsthe client application’s requests and returning the results to it. support for finalization hooks, enabling developers to add logic to perform cleanup before a service exits.Building SOA-Based HPC Applications  Improved Java Interoperability:One attractive aspect of SOA applications is the ability to With Java sample code provided in the Windows HPC Serverdevelop them quickly, without having to write a lot of low- 2008 R2 Software Development Kit (SDK), developers canlevel code. To achieve this, developers need to be able to easily more easily write Java-based client applications that commu-encapsulate core calculations as software modules that can nicate with .NET services – and enjoy the same level of func-be deployed and run on the cluster. These software modules tionality provided with clients based on the .NET Frameworkidentify and marshal the data required for each calculation and and Windows Communication Foundation.optimize performance by minimizing the data movement andcommunication overhead. Running SOA-Based HPC Applications In addition to developing SOA applications quickly, organi-Microsoft Visual Studio provides easy-to-use Windows Commu- zations must be able to run those applications efficiently,nication Foundation service templates and service referencing securely, and reliably. The SOA runtime in Windows HPC Serverutilities to help software developers quickly prototype, debug, 2008 R2 helps organizations meet those needs through featuresand unit-test SOA applications, with Windows Communication such as low-latency round-trips for efficiently distributing short 83
  • WINDOWS HPC SERVER 2008 R2 calculation requests, end-to-end Kerberos authentication with Windows Communication Foundation transport-levelNETWORKING AND MPI security, and dynamic allocation of resources to service instances. Windows HPC Server 2008 R2 also provides several new features to help organizations more reliably run theirFIGURE 5 NETWORKDIRECT ARCHITECTURE SOA applications, including support for broker restart/ failover and message persistence. Socket-Based MPI App App  Message Resilience: In the case of a temporary broker node failure or a cata- MS-MPI strophic failure of the cluster, the SOA broker nodes will Windows Sockets persist calculation requests and results. The session can (Winsock + WSD) RDMA Networking continue without lost requests or results after the cluster recovers and the broker nodes are restarted.  High-Availability Broker Nodes (Broker Restart/Failover):TCP/EthernetNetworking WinSock Direct NetworkDirect Furthermore, the SOA runtime in Windows HPC Server Provider Provider 2008 R2 adds support for automated broker failover, User Mode Access LayerUser Mode enabling organizations to preserve computation results TCP Kernel Mode in the event of a failure – an essential requirement for nonstop processing of mission-critical applications. Kernel By-Pass IP Configured using Microsoft Message Queuing (MSMQ) on NDIS remote storage and failover broker nodes, the cluster will Mini-port migrate active sessions on failed broker nodes to healthy Driver ones, thereby enabling nonstop processing. Hardware Driver Networking Hardware (ISV) App CCP OS IHV Component Component Component 84
  • NETWORKING AND MPI NetworkDirectWindows HPC Server 2008 R2 uses the Microsoft Message MS-MPI can take advantage of NetworkDirect – a remote directPassing Interface (MS-MPI), a portable, flexible, interconnect- memory access (RDMA)-based interface – for superior network-independent API for messaging within and between HPC nodes. ing performance and CPU efficiency. As shown in Figure 5,MS-MPI is based on the Argonne National Laboratory open- NetworkDirect uses a more direct path from MPI applicationssource MPICH2 implementation, and is compatible with the to networking hardware, resulting in very fast and efficient net-MPI2 standard. working. Speeds and latencies are similar to those of custom, hardware-native interfaces from hardware providers.MS-MPI can run over Gigabit Ethernet, 10 Gigabit Ethernet, andhigh-performance networking hardware such as InfiniBand, Easier Troubleshooting of MPI ApplicationsiWARP Ethernet, and Myrinet – or any other type of intercon- MS-MPI integrates with Event Tracing for Windows to facilitatenect that provides a WinSock Direct, NetworkDirect, or TCP/IP performance-tuning, providing a time-synchronized log forinterface. MS-MPI includes application support (bindings) for debugging MPI system and application events across multiplethe C, Fortran77, and Fortran90 programming languages. With computers running in parallel. In addition, Microsoft VisualWindows HPC Server 2008 R2, organizations also can take ad- Studio 2008 includes a MPI Cluster Debugger that works withvantage of new interconnect options, such as support for RDMA MS-MPI. Developers can launch their MPI applications on multipleover Ethernet (iWARP) from Intel and new RDMA over Infiniband compute nodes from within the Visual Studio environment,QDR (40 Gbps) hardware. and Visual Studio will automatically connect the processes on each node, enabling developers individually pause and examineMS-MPI is optimized for shared memory communication to program variables on each node.benefit the multicore systems prevalent in today’s HPC clusters.MS-MPI in Windows HPC Server 2008 R2 introduces optimization Tuning Wizard for LINPACK (“Lizard”)of shared memory implementations for new Intel “Nehalem”- A new feature for Windows HPC Server 2008 R2, the Tuning Wiz-based processors, with internal testing by Microsoft showing ard for LINPACK (“Lizard”) is a pushbutton, standalone executableup to a 20 to 30 percent performance improvement on typical that enables administrators to easily measure computationalcommercial HPC. performance and efficiency for an HPC cluster. Furthermore, 85
  • WINDOWS HPC SERVER 2008 R2 because it heavily loads the cluster, the Lizard can be a valuable tool for break-in and detecting issues related to configurationNETWORKING AND MPI and deployment, networking, power, cooling, and so on. The Lizard calculates the performance and efficiency of an HPC cluster by automatically running the LINPACK Benchmark sev- eral times, analyzing the results of each run and automatically adjusting the parameters used for the subsequent LINPACK run. Eventually, the Lizard determines the parameters that provide optimal LINPACK performance, which is measured in terms of billions of floating-point operations per second (GFLOPS) and percentage efficiency that was achieved at peak performance. After running the Lizard, administrators can review the LINPACK results and save both the results and the parameters that were used to achieve them to a file. Administrators can run the Lizard either in express tuning mode or in advanced tuning mode. In express tuning mode, the Lizard starts the tuning process immediately, using default values for LINPACK parameters. In advanced tuning mode, administrators can provide specific values to use when the tuning process starts, and can also configure how the tuning process is run. Microsoft, Windows, Windows Vista, Windows Server, Visual Studio, Excel, Office, Visual Basic, DirectX, Direct3D, Windows PowerShell and certain other trademarks and logos appearing in this brochure, are trademarks or registered trademarks of Microsoft Corporation. 86
  • transtec HPC expertise encompasses the Windows world as well.transtec is able to provide customers with Windows HPC systemsthat integrate seamlessly into their environment. Be it diskful ordiskless deployment via WDM, integration into an existing ADenvironment, or setup and configuration of a WSUS server forcentralized update provisioning, transtec gives customers anyWindows solution at hand that is needed for High ProductivityComputing.To meet more advanced requirements, by means of MoabAdaptive HPC Suite, transtec as a provider of HPC ProfessionalServices will also set up dynamical deployment solutions formixed Linux-Windows systems, either by dual-boot or by virtu-alization techniques. 87
  • WINDOWS HPC SERVER 2008 R2MICROSOFT OFFICE EXCEL SUPPORT MICROSOFT OFFICE EXCEL SUPPORT Microsoft Office Excel is a critical business application across a broad range of industries. With its wealth of statistical analysis functions, support for constructing complex analyses, and virtu- ally unlimited extensibility, Excel is clearly a tool of choice for analyzing business data. However, as calculations and model- ing performed in Excel become more and more complex, Excel workbooks can take longer and longer to calculate, thereby reducing the business value provided. Windows HPC Server 2008 R2 enables organizations to take advantage of HPC clusters to reduce calculation times for Excel workbooks by one or more orders of magnitude, scaling close to linearly as nodes or cores are added. Faster calculation times give business users and decision makers more information in less time, enabling more thorough analysis, faster access to im- portant information, and better informed decisions. In addition, running Excel workbooks on an HPC cluster provides unique benefits in terms of reliability, resource utilization, and account- ing and auditing support. 88
  • SPEEDING UP EXCEL WORKBOOKS no difference between a desktop function and a function runningWindows HPC Server 2008 R2 supports three different ap- on the cluster – except for better performance.proaches to calculating Excel workbooks on an HPC cluster:using Excel as a cluster SOA client, running Excel user defined Running Excel Workbooks on an HPC Clusterfunctions (UDFs) on a cluster, and running Excel workbooks on Many complex, long-running workbooks run iteratively – that is,a cluster. Using Excel as a cluster SOA client was possible with they perform a single calculation over and over, using differentearlier versions of Windows HPC Server. Running Excel UDFs sets of input data. Such workbooks might include complex math-and Excel workbooks on a cluster are new capabilities, both of ematical calculations contained in multiple worksheets, or theywhich require a combination of Windows HPC Server 2008 R2 might contain complex VBA applications. When a workbook runsand Office Excel 2010. iteratively, the best option for parallelizing the calculation can be to run the entire workbook on the cluster.Using Excel as a Cluster SOA ClientVisual Studio Tools for Office provides a programming environ- Windows HPC Server 2008 R2 supports running Office Excel 2010ment that is integrated with Excel and other Office products. instances on the compute nodes of an HPC cluster, so that mul-Using Visual Studio Tools for Office, developers can write custom tiple long-running and iterative workbooks can be calculated incode to run Excel calculations on an HPC cluster utilizing SOA parallel to achieve better performance. Many workbooks that runcalls. Visual Studio Tools for Office supports the client libraries for on the desktop can run on the cluster – including workbooks thatWindows HPC Server 2008 R2, enabling the integration of Excel use Visual Basic for Applications, macros, and third-party add-ins.with any service or application that runs on the cluster. Support for running Excel workbooks on a cluster also includes features designed to run workbooks without user interaction,Running Excel User Defined Functions on an HPC Cluster providing a robust platform for calculating Excel models withoutUser-defined functions (UDFs) are a well-established mechanism requiring constant oversight. Although this approach can be usedfor extending Excel, enabling functions that are contained in Excel to calculate many workbooks on a cluster, some development isextension libraries (XLLs) to be called from spreadsheet cells like required. When workbooks run on the desktop, calculation resultsany standard Excel function. Excel 2010 extends this model to the are inserted into spreadsheet cells. Because running Excel work-HPC cluster by enabling UDFs to be calculated on an HPC cluster books on a cluster uses Excel processes running on cluster nodes,by one or more compute nodes. If a long-running workbook in- the user or developer must define what data is to be calculatedcludes multiple independent calls to defined functions and these and how to retrieve the results. A macro framework is providedfunctions contribute to the overall processing time, then moving that can handle much of this work, and developers can customizethose calculations to the cluster can result in significant overall the framework or write their own code to manage calculationsperformance improvement. As far as users are concerned, there is and results, providing for virtually unlimited flexibility. 89
  • HPC computation results in the terabyte range are not uncom-mon. The problem in this context is not so much storing the dataat rest, but the performance of the necessary copying back andforth in the course of the computation job flow and the depen-dent job turn-around time. For interim results during a job run-time or for fast storage of input and results data, parallel filesystems have established themselves as the standard to meetthe ever-increasing performance requirements of HPC storagesystems. Parallel NFS is about to become the new standard frame-work for a parallel file system. 91
  • PARALLEL NFS YESTERDAY’S SOLUTION: NFS FOR HPC STORAGE The original Network File System (NFS) developed by SunTHE NEW STANDARD FOR HPC STORAGE Microsystems at the end of the eighties – now available in version 4.1 – has been established for a long time as a de-facto standard for the provisioning of a global namespace in net- worked computing. A very widespread HPC cluster solution includes a central mas- ter node acting simultaneously as an NFS server, with its local file system storing input, interim and results data and exporting them to all other cluster nodes. There is of course an immediate bottleneck in this method: When the load of the network is high, or where there are large numbers of nodes, the NFS server can no longer keep up deliver- ing or receiving the data. In high-performance computing es- FIGURE 1 A CLASSICAL NFS SERVER IS A BOTTLENECK NAS Head = NFS Server Cluster Nodes = NFS Clients pecially, the nodes are interconnected at least once via Gigabit Ethernet, so the sum total throughput is well above what an NFS server with a Gigabit interface can achieve. Even a power- ful network connection of the NFS server to the cluster, for 92
  • example with 10-Gigabit Ethernet, is only a temporary solution storage devices functioning as central components, is basi-to this problem until the next cluster upgrade. The fundamen- cally the commercial continuation of the “Network-Attachedtal problem remains – this solution is not scalable; in addition, Secure Disk (NASD)” project also developed by Garth Gibson atNFS is a difficult protocol to cluster in terms of load balancing: the Carnegie Mellon University.either you have to ensure that multiple NFS servers accessingthe same data are constantly synchronised, the disadvantage PARALLEL NFSbeing a noticeable drop in performance or you manually parti- Parallel NFS (pNFS) is gradually emerging as the future stan-tion the global namespace which is also time-consuming. NFS is dard to meet requirements in the HPC environment. From thenot suitable for dynamic load balancing as on paper it appears industry’s as well as the user’s perspective, the benefits ofto be stateless but in reality is, in fact, stateful. utilising standard solutions are indisputable: besides protect- ing end user investment, standards also ensure a defined levelTODAY’S SOLUTION: PARALLEL FILE SYSTEMS of interoperability without restricting the choice of productsFor some time, powerful commercial products have been available. As a result, less user and administrator training isavailable to meet the high demands on an HPC storage sys- required which leads to simpler deployment and at the sametem. The open-source solutions FraunhoferFS (FhGFS) from time, a greater acceptance.the Fraunhofer Competence Center for High PerformanceComputing or Lustre are widely used in the Linux HPC world, As part of the NFS 4.1 Internet Standard, pNFS will not onlyand also several other free as well as commercial parallel file adopt the semantics of NFS in terms of cache consistency orsystem solutions exist. security, it also represents an easy and flexible extension of the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1What is new is that the time-honoured NFS is to be upgraded, implementations do not have to include pNFS as a feature.including a parallel version, into an Internet Standard with The scheduled Internet Standard NFS 4.1 is today presented asthe aim of interoperability between all operating systems. IETF RFC 5661.The original problem statement for parallel NFS access waswritten by Garth Gibson, a professor at Carnegie Mellon Uni- The pNFS protocol supports a separation of metadata and data:versity and founder and CTO of Panasas. Gibson was already a pNFS cluster comprises so-called storage devices which storea renowned figure being one of the authors contributing to the data from the shared file system and a metadata serverthe original paper on RAID architecture from 1988. The original (MDS), called Director Blade with Panasas – the actual NFS 4.1statement from Gibson and Panasas is clearly noticeable in server. The metadata server keeps track of which data is storedthe design of pNFS. The powerful HPC file system developed on which storage devices and how to access the files, the so-by Gibson and Panasas, ActiveScale PanFS, with object-based called layout. Besides these “striping parameters”, the MDS also 93
  • PARALLEL NFSWHAT’S NEW IN NFS 4.1? WHAT’S NEW IN NFS 4.1? NFS 4.1 is a minor update to NFS 4, and adds new features to it. One of the optional features is parallel NFS (pNFS) but there is other new functionality as well. One of the technical enhancements is the use of sessions, a persistent server object, dynamically created by the client. By means of sessions, the state of an NFS connection can be stored, no matter whether the connection is live or not. Ses- sions survive temporary downtimes both of the client and the server. Each session has a so-called fore channel, which is the connec- tion connection from the client to the server for all RPC opera- tions, and optionally a back channel for RPC callbacks from the server that now can also be realized through firewall boundar- ies. Sessions can be trunked to increase the bandwidth. Besides session trunking there is also a client ID trunking for grouping together several sessions to the same client ID. By means of sessions, NFS can be seen as a really stateful proto- col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a necessary but unspecified reply cache within the NFS server is implemented to handle identical RPC operations that have been sent several times. This statefulness in reality is not very 94
  • robust, however, and sometimes leads to the well-known staleNFS handles. In NFS 4.1, the reply cache is now a mandatory partof the NFS implementation, storing the server replies to RPCrequests persistently on disk.Another new feature of NFS 4.1 is delegation for directories: NFSclients can be given temporary exclusive access to directories.Before, this has only been possible for simple files. With theforthcoming version 4.2 of the NFS standard, federated file-systems will be added as a feature, which represents the NFScounterpart of Microsoft’s DFS (distributed filesystem). 95
  • PARALLEL NFS manages other metadata including access rights or similar, which is usually stored in a file’s inode.THE NEW STANDARD FOR HPC STORAGE The layout types define which Storage Access Protocol is used by the clients to access the storage devices. Up until now, three potential storage access protocols have been definedFIGURE 2 PARALLEL NFS for pNFS: file, block and object-based layouts, the former being Metadata described in RFC 5661 direct, the latters in RFC 5663 and 5664, Server Co ntr respectively. Last but not least, a Control Protocol is also used .1 ol S4 Pro NF toc ol by the MDS and storage devices to synchronise status data. This protocol is deliberately unspecified in the standard to Storage Access Protocol give manufacturers certain flexibility. The NFS 4.1 standard does however specify certain conditions which a control pro- tocol has to fulfil, for example, how to deal with the change/ pNFS Clients Storage Device modify time attributes of files. pNFS supports backwards compatibility with non-pNFS compatible NFS 4 clients. In this case, the MDS itself gathers data from the storage devices on behalf of the NFS client and presents the data to the NFS client via NFS 4. The MDS acts as a kind of proxy server – which is e.g. what the Director Blades from Panasas do. PNFS LAYOUT TYPES If storage devices act simply as NFS 4 file servers, the file layout is used. It is the only storage access protocol directly speci- fied in the NFS 4.1 standard. Besides the stripe sizes and stripe locations (storage devices), it also includes the NFS file handles which the client needs to use to access the separate file areas. The file layout is compact and static, the striping information does not change even if changes are made to the file enabling 96
  • multiple pNFS clients to simultaneously cache the layout and Object-based Storage Devices (OSDs) and is heavily basedavoid synchronisation overhead between clients and the MDS on the DirectFLOW protocol of the ActiveScale PanFS fromor the MDS and storage devices. Panasas. From the very start, Object-Based Storage Devices were designed for secure authentication and access. So-calledFile system authorisation and client authentication can be well capabilities are used for object access which involves the MDSimplemented with the file layout. When using NFS 4 as the stor- issuing so-called capabilities to the pNFS clients. The ownershipage access protocol, client authentication merely depends on of these capabilities represents the authoritative access rightthe security flavor used – when using the RPCSEC_GSS security to an object.flavor, client access is kerberized, for example and the servercontrols access authorization using specified ACLs and crypto- pNFS can be upgraded to integrate other storage access pro-graphic processes. tocols and operating systems, and storage manufacturers also have the option to ship additional layout drivers for their pNFSIn contrast, the block/volume layout uses volume identifiers implementations.and block offsets and extents to specify a file layout. SCSI blockcommands are used to access storage devices. As the blockdistribution can change with each write access, the layout mustbe updated more frequently than with the file layout.Block-based access to storage devices does not offer any secureauthentication option for the accessing SCSI initiator. SecureSAN authorisation is possible with host granularity only, basedon World Wide Names (WWNs) with Fibre Channel or InitiatorNode Names (IQNs) with iSCSI. The server cannot enforce accesscontrol governed by the file system. On the contrary, a pNFS cli-ent basically voluntarily abides with the access rights, the stor-age device has to trust the pNFS client – a fundamental accesscontrol problem that is a recurrent issue in the NFS protocolhistory.The object layout is syntactically similar to the file layout, but ituses the SCSI object command set for data access to so-called 97
  • PARALLEL NFSPANASAS HPC STORAGE Having a many years’ experience in deploying parallel file systems like Lustre or FraunhoferFS (FhGFS) from smaller scales up to hundreds of terabytes capacity and throughputs of several gigabytes per second, transtec chose Panasas, the leader in HPC storage solutions, as the partner for providing highest performance and scalability on the one hand, and ease of management on the other. Therefore, with Panasas as the technology leader, and transtec’s overall experience and customer-oriented approach, customers can be assured to get the best possible HPC storage solution available. 98
  • The Panasas file system uses parallel and redundant access to high aggregate I/O load on the file system. The Panasas system isobject storage devices (OSDs), per-file RAID, distributed metada- designed to support several thousand clients and storage capaci-ta management, consistent client caching, file locking services, ties in excess of a petabyte.and internal cluster management to provide a scalable, fault The unique aspects of the Panasas system are its use of per-file,tolerant, high performance distributed file system. The clus- client-driven RAID, its parallel RAID rebuild, its treatment of differ-tered design of the storage system and the use of client-driven ent classes of metadata (block, file, system) and a commodity partsRAID provide scalable performance to many concurrent file based blade hardware with integrated UPS. Of course, the systemsystem clients through parallel access to file data that is striped has many other features (such as object storage, fault tolerance,across OSD storage nodes. RAID recovery is performed in paral- caching and cache consistency, and a simplified managementlel by the cluster of metadata managers, and declustered data model) that are not unique, but are necessary for a scalable systemplacement yields scalable RAID rebuild rates as the storage implementation.system grows larger. PANASAS FILE SYSTEM BACKGROUNDINTRODUCTION The two overall themes to the system are object storage, whichStorage systems for high performance computing environments affects how the file system manages its data, and clustering ofmust be designed to scale in performance so that they can be components, which allows the system to scale in performance andconfigured to match the required load. Clustering techniques are capacity.often used to provide scalability. In a storage cluster, many nodeseach control some storage, and the overall distributed file system Object Storageassembles the cluster elements into one large, seamless storage An object is a container for data and attributes; it is analogous tosystem. The storage cluster can be hosted on the same computers the inode inside a traditional UNIX file system implementation.that perform data processing, or they can be a separate cluster that Specialized storage nodes called Object Storage Devices (OSD) storeis devoted entirely to storage and accessible to the compute cluster objects in a local OSDFS file system. The object interface addressesvia a network protocol. objects in a two-level (partition ID/object ID) namespace. The OSDThe Panasas storage system is a specialized storage cluster, and wire protocol provides byteoriented access to the data, attributethis paper presents its design and a number of performance manipulation, creation and deletion of objects, and several othermeasurements to illustrate the scalability. The Panasas system is a specialized operations. Panasas uses an iSCSI transport to carryproduction system that provides file service to some of the largest OSD commands that are very similar to the OSDv2 standard cur-compute clusters in the world, in scientific labs, in seismic data rently in progress within SNIA and ANSI-T10.processing, in digital animation studios, in computational fluid The Panasas file system is layered over the object storage. Each filedynamics, in semiconductor manufacturing, and in general pur- is striped over two or more objects to provide redundancy and highpose computing environments. In these environments, hundreds bandwidth access. The file system semantics are implemented byor thousands of file system clients share data and generate very metadata managers that mediate access to objects from clients of 99
  • PARALLEL NFS the file system. The clients access the object storage using the iSCSI/OSD protocol for Read and Write operations. The I/OPANASAS HPC STORAGE operations proceed directly and in parallel to the storage nodes, bypassing the metadata managers. The clients interact with the out-of-band metadata managers via RPC to obtain access capa- bilities and location information for the objects that store files. Object attributes are used to store file-level attributes, and directories are implemented with objects that store name to object ID mappings. Thus the file system metadata is kept in the object store itself, rather than being kept in a separate databasePANASAS SYSTEM COMPONENTS or some other form of storage on the metadata nodes. System Software Components COMPUTE NODE The major software subsystems are the OSDFS object storage system, the Panasas file system metadata manager, the Panasas Client file system client, the NFS/CIFS gateway, and the overall cluster management system. RPC  The Panasas client is an installable kernel module that runs inside the Linux kernel. The kernel module implements the stan- dard VFS interface, so that the client hosts can mount the file NFS/CIFS SysMgr iSCSI/ system and use a POSIX interface to the storage system. OSD Client PanFS  Each storage cluster node runs a common platform that is based on FreeBSD, with additional services to provide hardware moni- toring, configuration management, and overall control.MANAGER NODE  The storage nodes use a specialized local file system (OSDFS) that implements the object storage primitives. They imple- OSDFS ment an iSCSI target and the OSD command set. The OSDFS object store and iSCSI target/OSD command processor are kernel modules. OSDFS is concerned with traditional block- level file system issues such as efficient disk arm utilization, STORAGE NODE media management (i.e., error handling), high throughput, as 100
  • well as the OSD interface. Commodity Hardware Platform The cluster manager (SysMgr) maintains the global configurati- The storage cluster nodes are implemented as blades that are very on, and it controls the other services and nodes in the storage compact computer systems made from commodity parts. The cluster. There is an associated management application that pro- blades are clustered together to provide a scalable platform. The vides both a command line interface (CLI) and an HTML interface OSD StorageBlade module and metadata manager DirectorBlade (GUI). These are all user level applications that run on a subset module use the same form factor blade and fit into the same chas- of the manager nodes. The cluster manager is concerned with sis slots. membership in the storage cluster, fault detection, configurati- on management, and overall control for operations like software Storage Management upgrade and system restart. Traditional storage management tasks involve partitioning avail- The Panasas metadata manager (PanFS) implements the file able storage space into LUNs (i.e., logical units that are one or more system semantics and manages data striping across the object disks, or a subset of a RAID array), assigning LUN ownership to storage devices. This is a user level application that runs on different hosts, configuring RAID parameters, creating file systems every manager node. The metadata manager is concerned with or databases on LUNs, and connecting clients to the correct server distributed file system issues such as secure multi-user access, for their storage. This can be a labor-intensive scenario. Panasas maintaining consistent file- and object-level metadata, client provides a simplified model for storage management that shields cache coherency, and recovery from client, storage node, and the storage administrator from these kinds of details and allow a metadata server crashes. Fault tolerance is based on a local tran- single, part-time admin to manage systems that were hundreds of saction log that is replicated to a backup on a different manager terabytes in size. node. The Panasas storage system presents itself as a file system with a The NFS and CIFS services provide access to the file system POSIX interface, and hides most of the complexities of storage man- for hosts that cannot use our Linux installable file system agement. Clients have a single mount point for the entire system. client. The NFS service is a tuned version of the standard The /etc/fstab file references the cluster manager, and from that FreeBSD NFS server that runs inside the kernel. The CIFS ser- the client learns the location of the metadata service instances. The vice is based on Samba and runs at user level. In turn, these administrator can add storage while the system is online, and new services use a local instance of the file system client, which resources are automatically discovered. To manage available stor- runs inside the FreeBSD kernel. These gateway services run age, Panasas introduces two basic storage concepts: a physical stor- on every manager node to provide a clustered NFS and CIFS age pool called a BladeSet, and a logical quota tree called a Volume. service. The BladeSet is a collection of StorageBlade modules in one or more shelves that comprise a RAID fault domain. Panasas mitigates the risk of large fault domains with the scalable rebuild performance 101
  • PARALLEL NFS described below in the text. The BladeSet is a hard physical bound- ary for the volumes it contains. A BladeSet can be grown at anyPANASAS HPC STORAGE time, either by adding more StorageBlade modules, or by merging two existing BladeSets together. The Volume is a directory hierarchy that has a quota constraint and is assigned to a particular BladeSet. The quota can be changed at any time, and capacity is not allocated to the Volume until it is used, so multiple volumes compete for space within their BladeSet and grow on demand. The files in those volumes are distributed among all the StorageBlade modules in the BladeSet. Volumes appear in the file system name space as directories. Clients have a single mount point for the whole storage system, and volumes are simply directories below the mount point. There is no need to update client mounts when the admin cre- ©2011 Panasas Incorporated. All rights reserved. Panasas, ates, deletes, or renames volumes. the Panasas logo, Accelerating Time to Results, ActiveScale, DirectFLOW, DirectorBlade, StorageBlade, PanFS, PanActive Automatic Capacity Balancing and MyPanasas are trademarks or registered trademarks of Panasas, Inc. in the United States and other countries. Capacity imbalance occurs when expanding a BladeSet (i.e., add- All other trademarks are the property of their respective ing new, empty storage nodes), merging two BladeSets, and re- owners. Information supplied by Panasas, Inc. is believed placing a storage node following a failure. In the latter scenario, to be accurate and reliable at the time of publication, but the imbalance is the result of the RAID rebuild, which uses spare Panasas, Inc. assumes no responsibility for any errors that may appear in this document. Panasas, Inc. reserves the capacity on every storage node rather than dedicating a specific right, without notice, to make changes in product design, “hot spare” node. This provides better throughput during rebuild, specifications and prices. Information is subject to change but causes the system to have a new, empty storage node after without notice. the failed storage node is replaced. The system automatically bal- ances used capacity across storage nodes in a BladeSet using two mechanisms: passive balancing and active balancing. Passive balancing changes the probability that a storage node will be used for a new component of a file, based on its available capac- ity. This takes effect when files are created, and when their stripe size is increased to include more storage nodes. Active balancing 102
  • is done by moving an existing component object from one storage quires some additional work to handle failures. Clients are respon-node to another, and updating the storage map for the affected file. sible for generating good data and good parity for it. Because theDuring the transfer, the file is transparently marked read-only by RAID equation is per-file, an errant client can only damage its ownthe storage management layer, and the capacity balancer skips files data. However, if a client fails during a write, the metadata managerthat are being actively written. Capacity balancing is thus transpar- will scrub parity to ensure the parity equation is correct.ent to file system clients. The second advantage of client-driven RAID is that clients can perform an end-to-end data integrity check. Data has to go throughOBJECT RAID AND RECONSTRUCTION the disk subsystem, through the network interface on the storagePanasas protects against loss of a data object or an entire storage nodes, through the network and routers, through the NIC on thenode by striping files across objects stored on different storage client, and all of these transits can introduce errors with a very lownodes, using a fault-tolerant striping algorithm such as RAID-1 or probability. Clients can choose to read parity as well as data, andRAID-5. Small files are mirrored on two objects, and larger files are verify parity as part of a read operation. If errors are detected, thestriped more widely to provide higher bandwidth and less capacity operation is retried. If the error is persistent, an alert is raised andoverhead from parity information. The per-file RAID layout means the read operation fails. By checking parity across storage nodesthat parity information for different files is not mixed together, and within the client, the system can ensure end-to-end data integrity.easily allows different files to use different RAID schemes alongside This is another novel property of per-file, client-driven RAID.each other. This property and the security mechanisms of the OSD Third, per-file RAID protection lets the metadata managers rebuildprotocol makes it possible to enforce access control over files even files in parallel. Although parallel rebuild is theoretically possibleas clients access storage nodes directly. It also enables what is per- in block-based RAID, it is rarely implemented. This is due to thehaps the most novel aspect of our system, client-driven RAID. That fact that the disks are owned by a single RAID controller, even inis, the clients are responsible for computing and writing parity. The dual-ported configurations. Large storage systems have multipleOSD security mechanism also allows multiple metadata managers RAID controllers that are not interconnected. Since the SCSI Blockto manage objects on the same storage device without heavyweight command set does not provide fine-grained synchronization opera-coordination or interference from each other. tions, it is difficult for multiple RAID controllers to coordinate aClient-driven, per-file RAID has four advantages for large-scale complicated operation such as an online rebuild without externalstorage systems. First, by having clients compute parity for their communication. Even if they could, without connectivity to theown data, the XOR power of the system scales up as the number of disks in the affected parity group, other RAID controllers would beclients increases. We measured XOR processing during streaming unable to assist. Even in a high-availability configuration, each diskwrite bandwidth loads at 7% of the client’s CPU, with the rest going is typically only attached to two different RAID controllers, whichto the OSD/iSCSI/TCP/IP stack and other file system overhead. Mov- limits the potential speedup to 2x.ing XOR computation out of the storage system into the client re- When a StorageBlade module fails, the metadata managers that 103
  • PARALLEL NFS own Volumes within that BladeSet determine what files are affect- ed, and then they farm out file reconstruction work to every otherPANASAS HPC STORAGE metadata manager in the system. Metadata managers rebuild their own files first, but if they finish early or do not own any Volumes in the affected Bladeset, they are free to aid other metadata manag- ers. Declustered parity groups spread out the I/O workload among all StorageBlade modules in the BladeSet. The result is that larger storage clusters reconstruct lost data more quickly. The fourth advantage of per-file RAID is that unrecoverable faults can be constrained to individual files. The most commonly encoun- tered double-failure scenario with RAID-5 is an unrecoverable read error (i.e., grown media defect) during the reconstruction of a failed storage device. The second storage device is still healthy, but it has been unable to read a sector, which prevents rebuild of the sector lost from the first drive and potentially the entire stripe or LUN, depending on the design of the RAID controller. With block-based RAID, it is difficult or impossible to directly map any lost sectors back to higher-level file system data structures, so a full file system check and media scan will be required to locate and repair the damage. A more typical response is to fail the rebuild entirely. RAID controllers monitor drives in an effort to scrub out media defects and avoid this bad scenario, and the Panasas system does mediaDECLUSTERED PARITY GROUPS scrubbing, too. However, with high capacity SATA drives, the chance of encountering a media defect on drive B while rebuilding drive A is still significant. With per-file RAID-5, this sort of double failure means that only a single file is lost, and the specific file can be easily A h A G b h C G D i C J C J D i identified and reported to the administrator. While block-based e m D K f K e m RAID systems have been compelled to introduce RAID-6 (i.e., fault tolerant schemes that handle two failures), the Panasas solution b G A h b G A h D i b j C i e J is able to deploy highly reliable RAID-5 systems with large, high e m f K f m f K performance storage pools. 104
  • RAID Rebuild Performance For this simple example you can assume each parity element isRAID rebuild performance determines how quickly the system the same size so all the devices are filled equally. In a real sys-can recover data when a storage node is lost. Short rebuild tem, the component objects will have various sizes dependingtimes reduce the window in which a second failure can cause on the overall file size, although each member of a parity groupdata loss. There are three techniques to reduce rebuild times: re- will be very close in size. There will be thousands or millionsducing the size of the RAID parity group, declustering the place- of objects on each device, and the Panasas system uses activement of parity group elements, and rebuilding files in parallel balancing to move component objects between storage nodesusing multiple RAID engines. to level capacity.The rebuild bandwidth is the rate at which reconstructed data Declustering means that rebuild requires reading a subsetis written to the system when a storage node is being recon- of each device, with the proportion being approximately thestructed. The system must read N times as much as it writes, same as the declustering ratio. The total amount of data readdepending on the width of the RAID parity group, so the overall is the same with and without declustering, but with declus-throughput of the storage system is several times higher than tering it is spread out over more devices. When writing thethe rebuild rate. A narrower RAID parity group requires fewer reconstructed elements, two elements of the same parityread and XOR operations to rebuild, so will result in a higher group cannot be located on the same storage node. Declus-rebuild bandwidth. However, it also results in higher capac- tering leaves many storage devices available for the recon-ity overhead for parity data, and can limit bandwidth during structed parity element, and randomizing the placement ofnormal I/O. Thus, selection of the RAID parity group size is a each file’s parity group lets the system spread out the writetrade-off between capacity overhead, on-line performance, and I/O over all the storage. Thus declustering RAID parity groupsrebuild performance. has the important property of taking a fixed amount of re-Understanding declustering is easier with a picture. In the figure build I/O and spreading it out over more storage devices.on the left, each parity group has 4 elements, which are indicated Having per-file RAID allows the Panasas system to divideby letters placed in each storage device. They are distributed the work among the available DirectorBlade modules byamong 8 storage devices. The ratio between the parity group assigning different files to different DirectorBlade modules.size and the available storage devices is the declustering ratio, This division is dynamic with a simple master/worker modelwhich in this example is ½. In the picture, capital letters repre- in which metadata services make themselves available assent those parity groups that all share the second storage node. workers, and each metadata service acts as the master forIf the second storage device were to fail, the system would have the volumes it implements. By doing rebuilds in parallel onto read the surviving members of its parity groups to rebuild the all DirectorBlade modules, the system can apply more XORlost elements. You can see that the other elements of those parity throughput and utilize the additional I/O bandwidth ob-groups occupy about ½ of each other storage device. tained with declustering. 105
  • PARALLEL NFS METADATA MANAGEMENT There are several kinds of metadata in the Panasas system. ThesePANASAS HPC STORAGE include the mapping from object IDs to sets of block addresses, mapping files to sets of objects, file system attributes such as ACLs and owners, file system namespace information (i.e., directories), and configuration/management information about the storageCREATING A FILE cluster itself. Client Block-level Metadata Block-level metadata is managed internally by OSDFS, the file 1. CREATE 6. REPLY system that is optimized to store objects. OSDFS uses a floating block allocation scheme where data, block pointers, and object descriptors are batched into large write operations. The write 2 buffer is protected by the integrated UPS, and it is flushed to oplog 8 disk on power failure or system panics. Fragmentation was an Metadata Server 4 issue in early versions of OSDFS that used a first-fit block alloca- caplog tor, but this has been significantly mitigated in later versions 5 that use a modified best-fit allocator. Reply cache OSDFS stores higher level file system data structures, such as Txn_log the partition and object tables, in a modified BTree data struc- 3. CREATE 7. WRITE ture. Block mapping for each object uses a traditional direct/ indirect/double-indirect scheme. Free blocks are tracked by a proprietary bitmap-like data structure that is optimized for copy-on-write reference counting, part of OSDFS’s integrated OSDs support for object- and partition-level copy-on-write snapshots. Block-level metadata management consumes most of the cycles in file system implementations. By delegating storage management to OSDFS, the Panasas metadata managers have an order of magnitude less work to do than the equivalent SAN file system metadata manager that must track all the blocks in the system. 106
  • File-level Metadata component of the file, so this technique will always work. OnceAbove the block layer is the metadata about files. This includes user- the file is located, the metadata manager automatically up-visible information such as the owner, size, and modification time, dates the stored hints in the directory, allowing future accessesas well as internal information that identifies which objects store to bypass this step.the file and how the data is striped across those objects (i.e., the File operations may require several object operations. Thefile’s storage map). Our system stores this file metadata in object at- figure on the left shows the steps used in creating a file. Thetributes on two of the N objects used to store the file’s data. The rest metadata manager keeps a local journal to record in-progressof the objects have basic attributes like their individual length and actions so it can recover from object failures and metadatamodify times, but the higher-level file system attributes are only manager crashes that occur when updating multiple objects.stored on the two attribute-storing components. For example, creating a file is fairly complex task that requiresFile names are implemented in directories similar to traditional updating the parent directory as well as creating the newUNIX file systems. Directories are special files that store an array file. There are 2 Create OSD operations to create the first twoof directory entries. A directory entry identifies a file with a tuple components of the file, and 2 Write OSD operations, one to eachof <serviceID, partitionID, objectID>, and also includes two <osdID> replica of the parent directory. As a performance optimization,fields that are hints about the location of the attribute storing the metadata server also grants the client read and write accesscomponents. The partitionID/objectID is the two-level object num- to the file and returns the appropriate capabilities to the clientbering scheme of the OSD interface, and Panasas uses a partition as part of the FileCreate results. The server makes record offor each volume. Directories are mirrored (RAID-1) in two objects so these write capabilities to support error recovery if the clientthat the small write operations associated with directory updates crashes while writing the file. Note that the directory updateare efficient. (step 7) occurs after the reply, so that many directory updatesClients are allowed to read, cache and parse directories, or they can be batched together. The deferred update is protected bycan use a Lookup RPC to the metadata manager to translate the op-log record that gets deleted in step 8 after the successfula name to an <serviceID, partitionID, objectID> tuple and the directory update.<osdID> location hints. The serviceID provides a hint about The metadata manager maintains an op-log that records thethe metadata manager for the file, although clients may be object create and the directory updates that are in progress.redirected to the metadata manager that currently controls the This log entry is removed when the operation is complete. Iffile. The osdID hint can become out-of-date if reconstruction or the metadata service crashes and restarts, or a failure eventactive balancing moves an object. If both osdID hints fail, the moves the metadata service to a different manager node, thenmetadata manager has to multicast a GetAttributes to the stor- the op-log is processed to determine what operations wereage nodes in the BladeSet to locate an object. The partitionID active at the time of the failure. The metadata manager rollsand objectID are the same on every storage node that stores a the operations forward or backward to ensure the object store 107
  • PARALLEL NFS is consistent. If no reply to the operation has been generated, then the operation is rolled back. If a reply has been generatedPANASAS HPC STORAGE but pending operations are outstanding (e.g., directory updates), then the operation is rolled forward. The write capability is stored in a cap-log so that when a meta- data server starts it knows which of its files are busy. In addition to the “piggybacked” write capability returned by FileCreate, the client can also execute a StartWrite RPC to obtain a sepa- rate write capability. The cap-log entry is removed when the client releases the write cap via an EndWrite RPC. If the client reports an error during its I/O, then a repair log entry is made and the file is scheduled for repair. Read and write capabilities are cached by the client over multiple system calls, further reducing metadata server traffic. System-level Metadata The final layer of metadata is information about the overall sys- tem itself. One possibility would be to store this information in objects and bootstrap the system through a discovery protocol. The most difficult aspect of that approach is reasoning about the fault model. The system must be able to come up and be “Outstanding in the HPC world, the ActiveStor manageable while it is only partially functional. Panasas chose solutions provided by Panasas are undoubtedly instead a model with a small replicated set of system managers, the only HPC storage solutions that combine each that stores a replica of the system configuration metadata. highest scalability and performance with a Each system manager maintains a local database, outside of convincing ease of management.“ the object storage system. Berkeley DB is used to store tables that represent our system model. The different system manager instances are members of a replication set that use Lamport’s part-time parliament (PTP) protocol to make decisions and Thomas Gebert HPC Solution Architect update the configuration information. Clusters are configured with one, three, or five system managers so that the voting 108
  • quorum has an odd number and a network partition will cause addition of the volume. Recovery is enabled by having two PTPa minority of system managers to disable themselves. transactions. The initial PTP transaction determines if the volumeSystem configuration state includes both static state, such as the should be created, and it creates a record about the volume thatidentity of the blades in the system, as well as dynamic state such is marked as incomplete. Then the system manager does all theas the online/offline state of various services and error conditions necessary service activations, file and storage operations. Whenassociated with different system components. Each state update these all complete, a final PTP transaction is performed to commitdecision, whether it is updating the admin password or activating the operation. If the system manager crashes before the final PTPa service, involves a voting round and an update round according transaction, it will detect the incomplete operation the next time itto the PTP protocol. Database updates are performed within the restarts, and then roll the operation forward or backward.PTP transactions to keep the databases synchronized. Finally, thesystem keeps backup copies of the system configuration databaseson several other blades to guard against catastrophic loss of everysystem manager blade.Blade configuration is pulled from the system managers as part ofeach blade’s startup sequence. The initial DHCP handshake conveysthe addresses of the system managers, and thereafter the local OSon each blade pulls configuration information from the systemmanagers via RPC.The cluster manager implementation has two layers. The lowerlevel PTP layer manages the voting rounds and ensures that parti-tioned or newly added system managers will be brought up-to-datewith the quorum. The application layer above that uses the votingand update interface to make decisions. Complex system opera-tions may involve several steps, and the system manager has tokeep track of its progress so it can tolerate a crash and roll back orroll forward as appropriate.For example, creating a volume (i.e., a quota-tree) involves filesystem operations to create a top-level directory, object operationsto create an object partition within OSDFS on each StorageBlademodule, service operations to activate the appropriate metadatamanager, and configuration database operations to reflect the 109
  • Graphics chips started as fixed function graphics pipelines. Overthe years, these graphics chips became increasingly programmable,which led NVIDIA to introduce the first GPU or Graphics ProcessingUnit. In the 1999-2000 timeframe, computer scientists in particular,along with researchers in fields such as medical imaging and elec-tromagnetics started using GPUs for running general purpose com-putational applications. They found the excellent floating pointperformance in GPUs led to a huge performance boost for a rangeof scientific applications. This was the advent of the movementcalled GPGPU or General Purpose computing on GPUs. The problemwas that GPGPU required using graphics programming languageslike OpenGL and Cg to program the GPU. Developers had to maketheir scientific applications look like graphics applications and mapthem into problems that drew triangles and polygons. This limitedthe accessibility of tremendous performance of GPUs for science.NVIDIA realized the potential to bring this performance to the largerscientific community and decided to invest in modifying the GPUto make it fully programmable for scientific applications and addedsupport for high-level languages like C and C++. This led to the CUDAarchitecture for the GPU. 111
  • NVIDIA GPU COMPUTING WHAT IS GPU COMPUTING? GPU computing is the use of a GPU (graphics processing unit) toTHE CUDA ARCHITECTURE do general purpose scientific and engineering computing. The model for GPU computing is to use a CPU and GPU together in a heterogeneous computing model. The sequential part of the application runs on the CPU and the computationally intensive part runs on the GPU. From the user’s perspective, the applica- tion just runs faster because it is using the high-performance of the GPU to boost performance. The application developer has to modify the application to take the compute-intensive kernels and map them to the GPU. The rest of the application remains on the CPU. Mapping a function to the GPU involves rewriting the function to expose the paral- lelism in the function and adding “C” keywords to move data to and from the GPU. GPU computing is enabled by the massively parallel architec- “We are very proud to be one of the leading ture of NVIDIA’s GPUs called the CUDA architecture. The CUDA providers of Tesla systems who are able to architecture consists of 100s of processor cores that operate combine the overwhelming power of NVIDIA together to crunch through the data set in the application. Tesla systems with the fully engineered and thoroughly tested transtec hardware to a total CUDA PARALLEL ARCHITECTURE AND PROGRAMMING MODEL Tesla-based solution.” The CUDA parallel hardware architecture is accompanied by the CUDA parallel programming model that provides a set of abstractions that enable expressing fine-grained and coarse- grain data and task parallelism. The programmer can choose to Norbert Zeidler Senior HPC Solution Engineer express the parallelism in high-level languages such as C, C++, Fortran or driver APIs such as OpenCL and DirectX-11 Compute. 112
  • FIGURE 1 THE CUDA PARALLEL ARCHITECURE With the CUDA architecture and tools, developers are achieving dramatic speedups in fields such as medical imaging and natu- GPU Computing Applications ral resource exploration, and creating breakthrough applica- tions in areas such as image recognition and real-time HD video playback and encoding. CUDA enables this unprecedented C OpenCL DirectX FORTRAN Java and performance via standard APIs such as the soon to be released C++ Compute Python OpenCL and DirectX Compute, and high level programming lan- guages such as C/C++, Fortran, Java, Python, and the Microsoft NVIDIA GPU .NET Framework. with the CUDA Parallel Computing ArchitectureThe CUDA parallel programming model guides programmers CUDA: THE DEVELOPER’S VIEWto partition the problem into coarse sub-problems that can be The CUDA package includes three important components: thesolved independently in parallel. Fine grain parallelism in the CUDA Driver API (also known as “Low-Level API”), the CUDAsub-problems is then expressed such that each sub-problem toolkit (the actual development environment including runtimecan be solved cooperatively in parallel. The CUDA GPU architec- libraries) and a Software Development Kit (CUDA SDK) with codeture and the corresponding CUDA parallel computing model examples.are now widely deployed with 100s of applications and nearly a1000 published research papers. The CUDA toolkit is in principle a C development environ- ment and includes the actual compiler (nvcc), an update ofGPU COMPUTING WITH CUDA the PathScale C compiler, optimized FFT and BLAS librariesNVIDIA CUDA technology leverages the massively parallel as well as a visual profiler (cudaprof), a gdb-based debuggerprocessing power of NVIDIA GPUs. The CUDA architecture is a (cudagdb), shared libraries for the runtime environment forrevolutionary parallel computing architecture that delivers the CUDA programs (the “Runtime API”) and last but not least,performance of NVIDIA’s world-renowned graphics processor comprehensive documentation including a developer’stechnology to general-purpose GPU Computing. Applications manual.that run on the CUDA architecture can take advantage of aninstalled base of over one hundred million CUDA-enabled GPUs The CUDA Developer SDK includes examples with source codesin desktop and notebook computers, professional workstations, for matrix calculation, pseudo random number generators, im-and supercomputer clusters. age convolution, wavelet calculations and a lot more besides. 113
  • NVIDIA GPU COMPUTING THE CUDA ARCHITECTURE The CUDA Architecture consists of several components, in theTHE CUDA ARCHITECTURE boxes below:  Parallel compute engines inside NVIDIA GPUs  OS kernel-level support for hardware initialization, configu- ration, etc.  User-mode driver, which provides a device-level API for developers  PTX instruction set architecture (ISA) for parallel computing kernels and functions FIGURE 2 THE CUDA PROGRAMMING MODEL Device-level APIs Language Integration Applications Applications Applications Applications Using DirectX Using OpenCL Using the Using C, C++, Fortran, CUDA Driver API Java, Python HLSL OpenCL C C for CUDA C for CUDA Compute Shaders Compute Kernels Compute Kernels Compute Functions OpenCL C Runtime DirectX Driver for CUDA Compute 3 CUDA Driver PTX (ISA) 4 CUDA Support in OS Kernel 2 1 CUDA Parallel Compute Engines inside NVIDIA GPUs The CUDA Software Development Environment supports two different programming interfaces:  A device-level programming interface, in which the applica- tion uses DirectX Compute, OpenCL or the CUDA Driver API directly to configure the GPU, launch compute kernels, and read back results  A language integration programming interface, in which an application uses the C Runtime for CUDA and developers use 114
  • a small set of extensions to indicate which compute func- will be executed on the GPU, how GPU memory will be used, tions should be performed on the GPU instead of the CPU and how the parallel processing capabilities of the GPU will be used by the applicationWhen using the device-level programming interface, develop-ers write compute kernels in separate files using the kernel THE G80 ARCHITECTURElanguage supported by their API of choice. DirectX Compute NVIDIA’s GeForce 8800 was the product that gave birth to thekernels (aka “compute shaders”) are written in HLSL. OpenCL new GPU Computing model. Introduced in November 2006, thekernels are written in a C-like language called “OpenCL C”. The G80 based GeForce 8800 brought several key innovations toCUDA Driver API accepts kernels written in C or PTX assembler. GPU Computing:  G80 was the first GPU to support C, allowing programmersWhen using the language integration programming interface, to use the power of the GPU without having to learn a newdevelopers write compute functions in C and the C Runtime programming languagefor CUDA automatically handles setting up the GPU and  G80 was the first GPU to replace the separate vertex andexecuting the compute functions. This programming inter- pixel pipelines with a single, unified processor that executedface enables developers to take advantage of native support vertex, geometry, pixel, and computing programsfor high-level languages such as C, C++, Fortran, Java, Python,  G80 was the first GPU to utilize a scalar thread processor,and more, reducing code complexity and development costs eliminating the need for programmers to manually managethrough type integration and code integration: vector registers Type integration allows standard types as well as vector  G80 introduced the single-instruction multiple-thread (SIMT) types and user-defined types (including structs) to be used execution model where multiple independent threads execu- seamlessly across functions that are executed on the CPU te concurrently using a single instruction and functions that are executed on the GPU  G80 introduced shared memory and barrier synchronization for inter-thread communication Code integration allows the same function to be called from functions that will be executed on the CPU and functions In June 2008, NVIDIA introduced a major revision to the G80 ar- that will be executed on the GPU chitecture. The second generation unified architecture – GT200 (first introduced in the GeForce GTX 280, Quadro FX 5800, and When necessary to distinguish functions that will be exe- Tesla T10 GPUs) – increased the number of streaming proces- cuted on the CPU from those that will be executed on the sor cores (subsequently referred to as CUDA cores) from 128 to GPU, the term C for CUDA is used to describe the small set of 240. Each processor register file was doubled in size, allowing a extensions that allow developers to specify which functions greater number of threads to execute on-chip at any given time. 115
  • NVIDIA GPU COMPUTING Hardware memory access coalescing was added to improve memory access efficiency. Double precision floating pointCODENAME “FERMI” support was also added to address the needs of scientific and high-performance computing (HPC) applications. When designing each new generation GPU, it has always been the philosophy at NVIDIA to improve both existing application performance and GPU programmability; while faster application performance brings immediate benefits, it is the GPU’s relentless advancement in programmability that has allowed it to evolve into the most versatile parallel processor of our time. CODENAME “FERMI” The Fermi architecture is the most significant leap forward in GPU architecture since the original G80. G80 was the initial vision of what a unified graphics and computing parallel pro- cessor should look like. GT200 extended the performance andFIGURE 3 IMPROVED MEMORY SUBSYSTEM functionality of G80. With Fermi, NVIDIA has taken everything learned from the two prior processors and all the applications DRAM DRAM that were written for them, and employed a completely new approach to design to create the world’s first computational GPU. When they started laying the groundwork for Fermi, they Host Interface DRAM gathered extensive user feedback on GPU computing since the introduction of G80 and GT200, and focused on the following L2 Cache key areas for improvement: GigaThread DRAM  Improved double precision performance: while single precision floating point performance was on the order of ten times the performance of desktop CPUs, some DRAM DRAM GPU computing applications desired more double precision performance as well 116
  •  ECC support: - 64 KB of RAM with a configurable partitioning of ECC allows GPU computing users to safely deploy large shared memory and L1 cache numbers of GPUs in datacenter installations, and also ensure  Second Generation Parallel Thread Execution ISA: data-sensitive applications like medical imaging and financi- - Unified Address Space with Full C++ Support al options pricing are protected from memory errors - Optimized for OpenCL and DirectCompute True cache hierarchy: - Full IEEE 754-2008 32-bit and 64-bit precision some parallel algorithms were unable to use the GPU’s - Full 32-bit integer path with 64-bit extensions shared memory, and users requested a true cache architec- - Memory access instructions to support Transition to ture to aid them 64-bit addressing More shared memory: - Improved performance through predication many CUDA programmers requested more than 16 KB  Improved Memory Subsystem: of SM shared memory to speed up their applications - NVIDIA Parallel DataCache hierarchy with configurable L1 Faster context switching: and unified L2 Caches users requested faster context switches between - First GPU with ECC memory support application programs and faster graphics and - Greatly improved atomic memory operation performance compute interoperation  NVIDIA GigaThread Engine: Faster atomic operations: - 10x faster application context switching users requested faster read-modify-write atomic operations - Concurrent kernel execution for their parallel algorithms - Out of Order thread block execution - Dual overlapped memory transfer enginesWith these requests in mind, the Fermi team designed a pro-cessor that greatly increases raw compute horsepower, and AN OVERVIEW OF THE FERMI ARCHITECTUREthrough architectural innovations, also offers dramatically The first Fermi based GPU, implemented with 3.0 billion transis-increased programmability and compute efficiency. The key tors, features up to 512 CUDA cores. A CUDA core executes aarchitectural highlights of Fermi are: floating point or integer instruction per clock for a thread. The Third Generation Streaming Multiprocessor (SM): 512 CUDA cores are organized in 16 SMs of 32 cores each. The - 32 CUDA cores per SM, 4x over GT200 GPU has six 64-bit memory partitions, for a 384-bit memory - 8x the peak double precision floating point performance interface, supporting up to a total of 6 GB of GDDR5 DRAM over GT200 memory. A host interface connects the GPU to the CPU via PCI- - Dual Warp Scheduler simultaneously schedules and dis- Express. The GigaThread global scheduler distributes thread patches instructions from two independent warps blocks to SM thread schedulers. 117
  • NVIDIA GPU COMPUTING THIRD GENERATION STREAMING MULTIPROCESSOR The third generation SM introduces several architectural inno-CODENAME “FERMI” vations that make it not only the most powerful SM yet built, but also the most programmable and efficient. 512 High Performance CUDA cores Each SM features 32 CUDA processors – a fourfold increase over prior SM designs. Each CUDA processor has a fully pipe-FIGURE 4 THIRD GENERATION STREAMING MULTIPROCESSOR lined integer arithmetic logic unit (ALU) and floating point unit (FPU). Prior GPUs used IEEE 754-1985 floating point arithmetic. Instruction Cache Warp Scheduler Warp Scheduler The Fermi architecture implements the new IEEE 754-2008 Dispatch Unit Dispatch Unit floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. Register File (4096 × 32 bit) FMA improves over a multiply-add (MAD) instruction by doing Core Core Core Core LD/ST the multiplication and addition with a single final rounding LD/ST LD/ST SFU step, with no loss of precision in the addition. FMA is more Core Core Core Core LD/ST accurate than performing the operations separately. GT200 CUDA Core Core Core Core Core LD/ST implemented double precision FMA. Dispatch Port LD/ST Operand Collector SFU LD/ST Core Core Core Core LD/ST FP Unit INT Unit LD/ST In GT200, the integer ALU was limited to 24-bit precision for Core Core Core Core Result Queue LD/ST SFU multiply operations; as a result, multi-instruction emulation LD/ST Core Core Core Core LD/ST sequences were required for integer arithmetic. In Fermi, the Core Core Core Core LD/ST newly designed integer ALU supports full 32-bit precision LD/ST LD/ST SFU for all instructions, consistent with standard programming Core Core Core Core LD/ST language requirements. The integer ALU is also optimized to Interconnect Network efficiently support 64-bit and extended precision operations. 64 KB Shared Memory / L1 Cache Various instructions are supported, including Boolean, shift, move, compare, convert, bit-field extract, bit-reverse insert, 64 KB Uniform Cache and population count. 118
  • 16 Load/Store Units FIGURE 5 DOUBLE PRECISION APPLICATION PERFORMANCEEach SM has 16 load/store units, allowing source and destina- 450% 400%tion addresses to be calculated for sixteen threads per clock. 350%Supporting units load and store the data at each address to 300% GT200cache or DRAM. 250% Architecture Fermi 200% ArchitectureFour Special Function Units 150%Special Function Units (SFUs) execute transcendental instruc- 100%tions such as sine, cosine, reciprocal, and square root. Each SFU 50% 0%executes one instruction per thread, per clock; a warp executes Double Precision Double Precisionover eight clocks. The SFU pipeline is decoupled from the dis- Matrix Multiply Tri-Diagonal Solverpatch unit, allowing the dispatch unit to issue to other execu-tion units while the SFU is occupied. Most instructions can be dual issued; two integer instructions, two floating instructions, or a mix of integer, floating point,Designed for Double Precision load, store, and SFU instructions can be issued concurrently.Double precision arithmetic is at the heart of HPC applications Double precision instructions do not support dual dispatchsuch as linear algebra, numerical simulation, and quantum with any other operation.chemistry. The Fermi architecture has been specifically de-signed to offer unprecedented performance in double preci- 64 KB Configurable Shared Memory and L1 Cachesion; up to 16 double precision fused multiply-add operations One of the key architectural innovations that greatly improvedcan be performed per SM, per clock, a dramatic improvement both the programmability and performance of GPU applicationsover the GT200 architecture. FIGURE 6 DUAL WARP SCHEDULERDUAL WARP SCHEDULER Warp Scheduler Warp SchedulerThe SM schedules threads in groups of 32 parallel threadscalled warps. Each SM features two warp schedulers and two Instruction Dispatch Unit Instruction Dispatch Unitinstruction dispatch units, allowing two warps to be issued andexecuted concurrently. Fermi’s dual warp scheduler selects two Warp 8 Instruction 11 Warp 9 Instruction 11warps, and issues one instruction from each warp to a group Warp 2 Instruction 42 Warp 3 Instruction 33of sixteen cores, sixteen load/store units, or four SFUs. Because Warp 14 Instruction 95 Warp 15 Instruction 95 timewarps execute independently, Fermi’s scheduler does not needto check for dependencies from within the instruction stream. Warp 8 Instruction 12 Warp 9 Instruction 12Using this elegant model of dual-issue, Fermi achieves near Warp 14 Instruction 96 Warp 3 Instruction 34peak hardware performance. Warp 2 Instruction 43 Warp 15 Instruction 96 119
  • NVIDIA GPU COMPUTING is on-chip shared memory. Shared memory enables threads within the same thread block to cooperate, facilitates extensiveCODENAME “FERMI” reuse of on-chip data, and greatly reduces off-chip traffic. Shared memory is a key enabler for many high-performance CUDA ap- NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA plications. Parallel Data Cache and certain other trademarks and logos appearing in this brochure, are trademarks or G80 and GT200 have 16 KB of shared memory per SM. In the registered trademarks of NVIDIA Corporation. Fermi architecture, each SM has 64 KB of on-chip memory that can be configured as 48 KB of shared memory with 16 KB of L1 cache or as 16 KB of shared memory with 48 KB of L1 cache. Radix Sort using Shared Memory For existing applications that make extensive use of shared 500% memory, tripling the amount of shared memory yields sig- 450% 400% nificant performance improvements, especially for problems 350% that are bandwidth constrained. For existing applications that 300% 250% use shared memory as software managed cache, code can 200% be streamlined to take advantage of the hardware caching 150% system, while still having access to at least 16 KB of shared 100% 50% memory for explicit thread cooperation. Best of all, applica- 0% GT200 Architecture Fermi Architecture tions that do not use shared memory automatically benefit When using 48 KB of shared memory on Fermi, Radix Sort from the L1 cache,4.7x faster than GT200.performance executes allowing high CUDA programs to be built with minimum time and effort. Radix Sort using Shared Memory PhysX Fluid Collision for Convex ShapesFIGURE 7 RADIX SORT USING SHARED MEMORY FIGURE 8 PHYSX FLUID COLLISION FOR CONVEX SHAPES500% 300%450% 250%400%350% 200%300%250% 150%200%150% 100%100% 50% 50% 0% 0% GT200 Architecture Fermi Architecture GT200 Architecture Fermi Architecture When using 48 KB of shared memory on Fermi, Radix Sort Physics algorithms such as fluid simulations especially executes 4.7x faster than GT200. benefit from Fermi’s caches. For convex shape collisions, Fermi is 2.7x faster than GT200. PhysX Fluid Collision for Convex Shapes300%250% 120200%
  • GPU G80 GT200 Fermi Transistors 681 million 1.4 billion 3.0 billion CUDA Cores 128 240 512 Double Precision Floating None 30 FMA ops/clock 256 FMA ops/clock Point Capability Single Precision Floating 128 MAD ops/clock 240 MAD ops/clock 512 FMA ops/clock Point Capability Special Function Units (SFUs) 2 2 4 / SM Warp schedulers (per SM) 1 1 2 Shared Memory (per SM) 16 KB 16 KB Configurable 48 KB or 16 KB L1 Cache (per SM) None None Configurable 16 KB or 48 KB L2 Cache None None 768 KB ECC Memory Support No No Yes Concurrent Kernels No No Up to 16 Load/Store Address Width 32-bit 32-bit 64-bitGIGATHREAD THREAD SCHEDULER 10x Faster Application Context SwitchingOne of the most important technologies of the Fermi architec- Like CPUs, GPUs support multitasking through the use of con-ture is its two-level, distributed thread scheduler. At the chip text switching, where each program receives a time slice oflevel, a global work distribution engine schedules thread blocks the processor’s resources. The Fermi pipeline is optimized toto various SMs, while at the SM level, each warp scheduler distrib- reduce the cost of an application context switch to below 25utes warps of 32 threads to its execution units. The first genera- microseconds, a significant improvement over last generationtion GigaThread engine introduced in G80 managed up to 12,288 GPUs. Besides improved performance, this allows developersthreads in realtime. The Fermi architecture improves on this to create applications that take greater advantage of frequentfoundation by providing not only greater thread throughput, but kernel-to-kernel communication, such as fine-grained interop-dramatically faster context switching, concurrent kernel execu- eration between graphics and PhysX applications.tion, and improved thread block scheduling. 121
  • NVIDIA GPU COMPUTING Concurrent Kernel Execution Fermi supports concurrent kernel execution, where different INTRODUCING NVIDIA PARALLEL NSIGHT kernels of the same application context can execute on the GPU at the same time. Concurrent kernel execution allows programs that execute a number of small kernels to utilize the whole GPU. For example, a PhysX program may invoke a fluids solver FIGURE 9 SERIAL KERNEL EXECUTION and a rigid body solver which, if executed sequentially, would use only half of the available thread processors. On the Fermi Kernel 1 Kernel 1 Kernel 2 architecture, different kernels of the same CUDA context can execute concurrently, allowing maximum utilization of GPU time Kernel 3 Kernel Kernel 2 resources. Kernels4 from different application contexts can still run sequentially with great efficiency thanks to the improved Kernel 5 context switching performance.time Kernel 3 INTRODUCING NVIDIA PARALLEL NSIGHT Kernel 4 NVIDIA Parallel Nsight is the first development environment designed specifically to support massively parallel CUDA C, Kernel 5 OpenCL, and DirectCompute applications. It bridges the pro- ductivity gap between CPU and GPU code by bringing parallel- aware hardware source code debugging and performance FIGURE 10 CONCURRENT KERNEL EXECUTION analysis directly into Microsoft Visual Studio, the most widely used integrated application development environment under Kernel 1 Kernel 2 Microsoft Windows.time Kernel 3 Kernel 4 Parallel Nsight allows Visual Studio developers to write and debug GPU source code using exactly the same tools and Kernel 5 interfaces that are used when writing and debugging CPU code, including source and data breakpoints, and memory inspection. Furthermore, Parallel Nsight extends Visual Studio functional- ity by offering tools to manage massive parallelism, such as the ability to focus and debug on a single thread out of the thou- sands of threads running parallel, and the ability to simply and efficiently visualize the results computed by all parallel threads. 122
  • Parallel Nsight is the perfect environment to develop co-processing applications that take advantage of both the CPUand GPU. It captures performance events and informationacross both processors, and presents the information to thedeveloper on a single correlated timeline. This allows devel-opers to see how their application behaves and performs onthe entire system, rather than through a narrow view that isfocused on a particular subsystem or processor.Parallel Nsight Debugger for GPU Computing Parallel Nsight Analysis Tool for GPU Computing Debug your CUDA C/C++ and DirectCompute source code  Isolate performance bottlenecks by viewing system-wide directly on the GPU hardware CPU+GPU events As the industry’s only GPU hardware debugging solution, it  Support for all major GPU Computing APIs, including CUDA C/ drastically increases debugging speed and accuracy C++, OpenCL, and Microsoft DirectCompute Use the familiar Visual Studio Locals, Watches, Memory and Breakpoints windowsFIGURE 11 DEBUGGER FOR GPU COMPUTING FIGURE 12 ANALYSIS TOOL FOR GPU COMPUTING 123
  • NVIDIA GPU COMPUTING Parallel Nsight Debugger for Graphics Development  Debug HLSL shaders directly on the GPU hardware. Drasti-INTRODUCING NVIDIA PARALLEL NSIGHT cally increasing debugging speed and accuracy over emula- ted (SW) debugging  Use the familiar Visual Studio Locals, Watches, Memory and Breakpoints windows with HLSL shaders, including Direct- Compute code  The Debugger supports all HLSL shader types: Vertex, Pixel, Geometry, and Tessellation FIGURE 13 DEBUGGER FOR GRAPHICS DEVELOPMENT 124
  • Parallel Nsight Graphics Inspector for Graphics Development Graphics Inspector captures Direct3D rendered frames for real-time examination The Frame Profiler automatically identifies bottlenecks and performance information on a per-draw call basis Pixel History shows you all operations that affected a given pixel transtec has strived for developing well-engineered GPU Computing solutions from the very beginning of the Tesla era. From High-Performance GPU Workstations to rack-mountedFIGURE 14 GRAPHICS INSPECTOR FOR GRAPHICS DEVELOPMENT Tesla server solutions, transtec has a broad range of specially designed systems available. As an NVIDIA Tesla Preferred Pro- vider (TPP), transtec is able to provide customers with the lat- est NVIDIA GPU technology as well as fully-engineered hybrid systems and Tesla Preconfigured Clusters. Thus, customers can be assured that transtec’s large experience in HPC cluster solu- tions is seamlessly brought into the GPU computing world. Performance Engineering made by transtec. 125
  • NVIDIA GPU COMPUTINGQLOGIC TRUESCALE INFINIBAND AND GPUS QLogic is a global leader and technology innovator in high performance networking, including adapters, switches and ASICs. EXECUTIVE OVERVIEW The High Performance Computing market’s continuing need for improved time-to-solution and the ability to explore expanding models seems unquenchable, requiring ever-faster HPC clusters. This has led many HPC users to implement graphic processing units (GPUs) into their clusters. While GPUs have traditionally been used solely for visualization or animation, today they serve as fully programmable, massively parallel processors, allowing computing tasks to be divided and concurrently processed on the GPU’s many processing cores. When multiple GPUs are inte- grated into an HPC cluster, the performance potential of the HPC cluster is greatly enhanced. This processing environment enables scientists and researchers to tackle some of the world’s most challenging computational problems. HPC applications modified to take advantage of the GPU pro- cessing capabilities can benefit from significant performance gains over clusters implemented with traditional processors. To obtain these results, HPC clusters with multiple GPUs require a high-performance interconnect to handle the GPU-to-GPU com- munications and optimize the overall performance potential of the GPUs. Because the GPUs place significant demands on the interconnect, it takes a high-performance interconnect, such as InfiniBand, to provide the low latency, high message rate, and bandwidth that are needed to enable all resources in the cluster to run at peak performance. 126
  • QLogic worked in concert with NVIDIA to optimize QLogic with other GPUDirect implementations; the CUDA libraries andTrueScale InfiniBand with NVIDIA GPU technologies. This solu- application code require no changes.tion supports the full performance potential of NVIDIA GPUsthrough an interface that is easy to deploy and maintain. OPTIMIZED PERFORMANCE QLogic used AMBER molecular dynamics simulation software toKey Points test clustered GPU performance with and without GPUDirect. Up to 44 percent GPU performance improvement versus Figure 15 shows that there is a significant performance gain of implementation without GPUDirect – a GPU computing up to 44 percent that results from streamlining the host memory product from NVIDIA that enables faster communication access to support GPU-to-GPU communications. between the GPU and InfiniBand QLogic TrueScale InfiniBand offers as much as 10 percent bet- CLUSTERED GPU PERFORMANCE ter GPU performance than other InfiniBand interconnects HPC applications that have been designed to take advantage Ease of installation and maintenance – QLogic’s implemen- of parallel GPU performance require a high-performance inter- tation offers a streamlined deployment approach that is connect, such as InfiniBand, to maximize that performance. In significantly easier than alternatives addition, the implementation or architecture of the InfiniBand interconnect can impact performance. The two industry-leadingEASE OF DEPLOYMENT InfiniBand implementations have very different architectures andOne of the key challenges with deploying clusters consisting of only one was specifically designed for the HPC market – QLogic’smulti-GPU nodes is to maximize application performance. With- TrueScale InfiniBand. TrueScale InfiniBand provides unmatchedout GPUDirect, GPU-to-GPU communications would require thehost CPU to make multiple memory copies to avoid a memory FIGURE 15. PERFORMANCE WITH AND WITHOUT GPUDIRECTpinning conflict between the GPU and InfiniBand. Each addi- Amber Performancetional CPU memory copy significantly reduces the performance Cellulose Testpotential of the GPUs. 4,5 44% 4 Better 3,5QLogic’s implementation of GPUDirect takes a streamlined 3 NS/DAY 2,5approach to optimizing NVIDIA GPU performance with QLogic 2 WithoutTrueScale InfiniBand. With QLogic’s solution, a user only needs 1,5 GPUDirect 1to update the NVIDIA driver with code provided and tested by 0,5 TrueScale with 0QLogic. Other InfiniBand implementations require the user to 8 GPU GPUDirectimplement a Linux kernel patch as well as a special InfiniBanddriver. The QLogic approach provides a much easier way to de-ploy, support, and maintain GPUs in a cluster without having tosacrifice performance. In addition, it is completely compatible 127
  • NVIDIA GPU COMPUTING performance benefits, especially as the GPU cluster is scaled. It offers high performance in all of the key areas that influence theQLOGIC TRUESCALE INFINIBAND AND GPUS performance of HPC applications, including GPU-based applica- tions. These factors include the following:  Scalable non-coalesced message rate performance greaterFIGURE 16 GPU SCALABLE PERFORMANCE WITH THE INDUSTRY’S than 25M messages per secondLEADING INFINIBANDS  Extremely low latency for MPI collectives, even on clusters AMBER consisting of thousands of nodes Myoglobin Test  Consistently low latency of one to two µS, even at scale 140 9,6% 120 Better 100 3,2% These factors and the design of QLogic TrueScale InfiniBand en- 80NS/DAY Even able it to optimize the performance of NVIDIA GPUs. The follow- 60 Other ing tests were performed on NVIDIA Tesla 2050s interconnected 40 InfiniBand 20 with QLogic TrueScale QDR InfiniBand at QLogic’s NETtrack TrueScale 0 Developer Center. The Tesla 2050 results for the industry’s other 2 4 8 leading InfiniBand are from the published results on the AMBER benchmark site.FIGURE 17 EXPLICIT SOLVENT BENCHMARK RESULTS FOR THE TWO Figure 16 shows performance results from the AMBER Myo-LEADING INFINIBANDS globin benchmark (2,492 atoms) when scaling from two to Explicit Solvent Tests eight Tesla 2050 GPUs. The results indicate that the QLogic 8 x GPU 50 TrueScale InfiniBand offers up to 10 percent more perfor- 6% 45 mance than the industry’s other leading InfiniBand when Better 40 35 both used their versions of GPUDirect. As the figure shows, 30NS/DAY 25 the performance difference increases as the application is 20 Other InfiniBand scaled to more GPUs. 15 5% 10 5 1% TrueScale 0 The next test (Figure 17) shows the impact of the InfiniBand Cellulose FactorIX DHFR interconnect on the performance of AMBER across models of various sizes. The following Explicit Solvent models were tested:  DHFR: 23,558 atoms  FactorIX: 90,906 atoms  Cellulose: 408,609 atoms 128
  • It is important to point out that the performance of the mod- This on its own is a significant accomplishment, but it is evenels is dependent on the model size, the size of the GPU cluster, more impressive when considering its original position on theand the performance of the InfiniBand interconnect. The SuperComputing Top500 list. In fact, the cluster is ranked atsmaller the model, the more it is dependent on the intercon- #404 on the Top500 list, but the combination of NVIDIA’s GPUnect due to the fact that the model’s components (atoms in performance, QLogic’s TrueScale performance, and low powerthe case of AMBER) are divided across the available GPUs in consumption enabled the cluster to move up 401 spots fromthe cluster to be processed for each step of the simulation. the Top500 list to reach number three on the Green500 list.For example, the DHFR test with its 23,557 atoms means that This is the most dramatic shift of any cluster in the top 50 ofeach Tesla 2050 in an eight-GPU cluster is processing only 2,945 the Green500. In part, the following are the reasons for suchatoms for each step of the simulation. The processing time is dramatic performance/watt results:relatively small when compared to the communication time.  Performance of the NVIDIA Telsa 2050 GPUIn contrast, the Cellulose model with its 408K atoms requires  Linpack performance efficiency of this cluster is 49 percent,each GPU to process 17 times more data per step than the which is almost 20 percent better than most other NVIDIADHFR test, so significantly more time is spent in GPU process- GPUbased clusters on the Top 500 listing than in communications.  The QLogic TrueScale InfiniBand Adapter required 25 - 50 percent less power than the alternative InfiniBand productThe preceding tests demonstrate that the TrueScale InfiniBandperforms better under load. The DHFR model is the most sensi- CONCLUSIONtive to the interconnect performance, and it indicates that Tru- The performance of the InfiniBand interconnect has a signifi-eScale offers six percent more performance than the alternative cant impact on the performance of GPU-based clusters. QLogic’sInfiniBand product. Combining the results from Figure 15 and TrueScale InfiniBand is designed and architected for the HPCFigure 16 illustrate that TrueScale InfiniBand provides better marketplace, and it offers an unmatched performance profileresults with smaller models on small clusters and better model with a GPU-based cluster. Finally, QLogic’s solution providesscalability for larger models on larger GPU clusters. an implementation that is easier to deploy and maintain, and allows for optimal performance in comparison to the industry’sPERFORMANCE/WATT ADVANTAGE other leading InfiniBand.Today the focus is not just on performance, but how efficientlythat performance can be delivered.This is an area in which QLogic TrueScale InfiniBand excels.The National Center for SuperCompute Applications (NCSA) hasa cluster based on NVIDIA GPUs interconnected with TrueScaleInfiniBand. This cluster is number three on the November 2010Green500 list with performance of 933 MFlops/Watt . 129
  • InfiniBand (IB) is an efficient I/O technology that provides high-speed data transfers and ultra-low latencies for computing andstorage over a highly reliable and scalable single fabric. TheInfiniBand industry standard ecosystem creates cost effectivehardware and software solutions that easily scale from genera-tion to generation.InfiniBand is a high-bandwidth, low-latency network interconnectsolution that has grown tremendous market share in the HighPerformance Computing (HPC) cluster community. InfiniBand wasdesigned to take the place of today’s data center networking tech-nology. In the late 1990s, a group of next generation I/O architectsformed an open, community driven network technology to providescalability and stability based on successes from other networkdesigns. Today, InfiniBand is a popular and widely used I/O fabricamong customers within the Top500 supercomputers: major Uni-versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic,Reservoir, Modeling Applications); Computer Aided Design and En-gineering; Enterprise Oracle; and Financial Applications. 131
  • INFINIBAND InfiniBand was designed to meet the evolving needs of the high performance computing market. Computational scienceHIGH-SPEED INTERCONNECTS depends on InfiniBand to deliver:  High Bandwidth: Supports host connectivity of 10Gbps with Single Data Rate (SDR), 20Gbps with Double Data Rate (DDR), and 40Gbps with Quad Data Rate (QDR), all while offering an 80Gbps switch for link switching  Low Latency: Accelerates the performance of HPC and enterprise computing applications by providing ultra-low latencies  Superior Cluster Scaling: Point-to-point latency remains low as node and core counts scale – 1.2 µs. Highest real message per adapter: each PCIe x16 adapter drives 26 million mes- sages per second. Excellent communications/computation overlap among nodes in a cluster  High Efficiency: InfiniBand allows reliable protocols like Re- QLogic is a global leader and technology innovator mote Direct Memory Access (RDMA) communication to occur in high performance networking, including adapters, switches and ASICs. between interconnected hosts, thereby increasing efficiency  Fabric Consolidation and Energy Savings: InfiniBand can consolidate networking, clustering, and storage data over a single fabric, which significantly lowers overall power, real estate, and management overhead in data centers. Enhan- ced Quality of Service (QoS) capabilities support running and managing multiple workloads and traffic classes 132
  •  Data Integrity and Reliability: InfiniBand provides the high- memory access, and can recover from transmission errors. est levels of data integrity by performing Cyclic Redundancy Host channel adapters can communicate with a target chan- Checks (CRCs) at each fabric hop and end-to-end across the nel adapter or a switch. A host channel adapter can be a fabric to avoid data corruption. To meet the needs of mission standalone InfiniBand card or it can be integrated on a system critical applications and high levels of availability, InfiniBand motherboard. QLogic TrueScale InfiniBand host channel adapt- provides fully redundant and lossless I/O fabrics with auto- ers outperform the competition with the industry’s highest matic failover path and link layer multi-paths message rate. Combined with the lowest MPI latency and high- est effective bandwidth, QLogic host channel adapters enableCOMPONENTS OF THE INFINIBAND FABRIC MPI and TCP applications to scale to thousands of nodes withInfiniBand is a point-to-point, switched I/O fabric architecture. unprecedented price performance.Point-to-point means that each communication link extends be-tween only two devices. Both devices at each end of a link have full Target Channel Adapterand exclusive access to the communication path. To go beyond a This adapter enables I/O devices, such as disk or tape storage, topoint and traverse the network, switches come into play. By adding be located within the network independent of a host computer.switches, multiple points can be interconnected to create a fabric. Target channel adapters include an I/O controller that is specificAs more switches are added to a network, aggregated bandwidth to its particular device’s protocol (for example, SCSI, Fibre Chan-of the fabric increases. By adding multiple paths between devices, FIGURE 1 TYPICAL INFINIBAND HIGH PERFORMANCE CLUSTERswitches also provide a greater level of redundancy.The InfiniBand fabric has four primary components, which areexplained in the following sections: Host Channel Adapter Host Channel Adapter Target Channel Adapter Switch Subnet Manager InfiniBand Switch With Subnet ManagerHost Channel AdapterThis adapter is an interface that resides within a server and Target Channel Adaptercommunicates directly with the server’s memory and proces-sor as well as the InfiniBand Architecture (IBA) fabric. Theadapter guarantees delivery of data, performs advanced 133
  • INFINIBAND nel (FC), or Ethernet). Target channel adapters can communicate with a host channel adapter or a switch.HIGH-SPEED INTERCONNECTS Switch An InfiniBand switch allows many host channel adapters and tar- get channel adapters to connect to it and handles network traf- fic. The switch looks at the “local route header” on each packet of data that passes through it and forwards it to the appropriate location. The switch is a critical component of the InfiniBand implementation that offers higher availability, higher aggregate bandwidth, load balancing, data mirroring, and much more. A group of switches is referred to as a Fabric. If a host computer is down, the switch still continues to operate. The switch also frees up servers and other devices by handling network traffic. The QLogic TrueScale 12000 family of Multi-Protocol Fabric Directors is the most highly integrated cluster computing interconnect solution available. An ideal solution HPC, database clustering, and grid utility computing applications, the 12000 Fabric Directors maximize cluster and grid computing intercon- nect performance while simplifying and reducing the cost of operating a data center. Subnet Manager The subnet manager is an application responsible for configuring the local subnet and ensuring its continued operation. Configura- tion responsibilities include managing switch setup and recon- figuring the subnet if a link goes down or a new one is added. 134
  • HOW HIGH PERFORMANCE COMPUTING HELPS VERTICAL InfiniBand offers the promise of low latency, high bandwidth, andAPPLICATIONS unmatched scalability demanded by high performance computingEnterprises that want to do high performance computing applications. IB adapters and switches that perform well on thesemust balance the following scalability metrics as the core size key metrics allow enterprises to meet their high-performance andand the number of cores per node increase: MPI needs with optimal efficiency. The IB solution allows enterpris- es to quickly achieve their compute and business goals. Latency and Message Rate Scalability: must allow near linear growth in productivity as the number of compute cores is scaled. Power and Cooling Efficiency: as the cluster is scaled, power and cooling requirements must not become major concerns in today’s world of energy shortages and high-cost energy.Vertical Market Application Segment InfiniBand Value Mix of Independent Service Provider (ISP) and home grownOil & Gas Low latency, high bandwidth codes: reservoir modeling Mostly Independent Software Vendor (ISV) codes: crash, airComputer Aided Engineering (CAE) High message rate, low latency, scalability flow, and fluid flow simulations Home grown codes: labs, defense, weather, and a wideGovernment High message rate, low latency, scalability, high bandwidth range of appsEducation Home grown and open source codes: a wide range of apps High message rate, low latency, scalability, high bandwidthFinancial Mix of ISP and home grown codes: market simulation and High performance IP, scalability, high bandwidth trading floor Mostly ISV codes: molecular simulation, computationalLife and Materials Science Low latency, high message rates chemistry, and biology apps 135
  • INFINIBANDTOP 10 REASONS TO USE QLOGIC TRUESCALE INFINIBAND TOP 10 REASONS TO USE QLOGIC TRUESCALE INFINIBAND 1. Predictable Low Latency Under Load – Less Than 1.0 µs. TrueScale is designed to make the most of multi-core nodes by providing ultra-low latency and significant message rate scalability. As additional compute resources are added to the QLogic TrueScale InfiniBand solution, latency and message rates scale linearly. HPC applications can be scaled without having to worry about diminished utilization of compute resources. 2. Quad Data Rate (QDR) Performance. The QLogic 12000 switch family runs at lane speeds of 10Gbps, providing a full bi-sectional bandwidth of 40Gbps (QDR). In addi- tion, the 12000 switch has the unique capability of riding “Effective fabric management has become the through periods of congestion with features such as de- most important factor in maximizing perfor- terministically low latency. The QLogic family of TrueScale mance in an HPC cluster. With IF S 6.0, QLogic 12000 products offers the lowest latency of any IB switch has addressed all of the major fabric manage- and high performance transfers with the industry’s most ment issues in a product that in many ways robust signal integrity. goes beyond what others are offering.” 3. Flexible QoS Maximizes Bandwidth Use. The QLogic 12800 advanced design is based on an architecture that provides comprehensive virtual fabric partitioning capabilities that Michael Wirth HPC Presales Specialist 136
  • enable the IB fabric to support the evolving requirements of tion levels that match standard network interface cards an organization. and adapter semantics with no OS or app changes – they just work. Additionally, the TrueScale family of products is4. Unmatched Scalability – 18 to 864 Ports per Switch. QLogic compliant with the InfiniBand Trade Association (IBTA) open offers the broadest portfolio (five chassis and two edge specification, so QLogic products inter-operate with any switches) from 18 to 864 TrueScale InfiniBand ports, allowing IBTA-compliant InfiniBand vendor. Being IBTA-compliant customers to buy switches that match their connectivity, makes the QLogic 12000 family of switches ideal for network space, and power requirements. consolidation; sharing and scaling I/O pools across servers; and to pool and share I/O resources between servers.5. Highly Reliable and Available. Reliability And Serviceability (RAS) that is proven in the most demanding Top500 and En- 9. Modular Configuration Flexibility. The QLogic 12000 series terprise environments is designed into QLogic’s 12000 series switches offer configuration and scalability flexibility that with hot swappable components, redundant components, meets the requirements of either a high-density or high- customer replaceable units, and non-disruptive code load. performance compute grid by offering port modules that address both needs. Units can be populated with Ultra High6. Lowest Per-Port and Cooling Requirements. The True- Density (UHD) leafs for maximum connectivity or Ultra High Scale 12000 offers the lowest power consumption and Performance (UHP) leafs for maximum performance. The the highest port density – 864 total TrueScale InfiniBand high scalability 24-port leaf modules support configurations ports in a single chassis makes it unmatched in the indus- between 18 and 864 ports, providing the right size to start and try. This results in delivering the lowest power per port the capability to grow as your grid grows. for a director switch (7.8 watts per port) and the lowest power per port for an edge switch (3.3 watts per port). 10. Option to Gateway to Ethernet and Fibre Channel Net- works. QLogic offers multiple options to enable hosts on7. Easy to Install and Manage. QLogic installation, configu- InfiniBand fabrics to transparently access Fibre Channel ration, and monitoring Wizards reduce time-to-ready. The based storage area networks (SANs) or Ethernet based local QLogic InfiniBand Fabric Suite (IFS) assists in diagnosing area networks (LANs). problems in the fabric. Non-disruptive firmware upgrades provide maximum availability and operational simplicity. QLogic TrueScale architecture and the resulting family of prod- ucts delivers the promise of InfiniBand to the enterprise today.8. Protects Existing InfiniBand Investments. Seamless virtual I/O integration at the Operating System (OS) and applica- 137
  • INFINIBANDINTEL MPI LIBRARY 4.0 PERFORMANCE INTRODUCTIONFIGURE 2 TYPICAL INFINIBAND HIGH PERFORMANCE CLUSTER Intel’s latest MPI release, Intel MPI 4.0, is now optimized to work with QLogic’s TrueScale InfiniBand adapter. Intel MPI 4.0 Purpose: Compare the performance of Intel MPI Library 4.0 can now directly call QLogic’s TrueScale Performance Scale and 3.1 with QLogic TrueScale™ InfiniBand Messaging (PSM) interface. The PSM interface is designed Benchmark: PUMA Flow to optimize MPI application performance. This means that Cluster: QLogic/IBM iDataPlex™ cluster organizations will be able to achieve a significant perfor- System Configuration: The NET track IBM Q-Blue Cluster/ iDataPlex nodes were configured as follows: mance boost with a combination of Intel MPI 4.0 and QLogic TrueScale InfiniBand. Processor Intel Xeon® CPU X5570 @ 2,93 GHz FIGURE 3 PERFORMANCE WITH AND WITHOUT GPUDIRECT Memory 24GB (6x4GB) @ 1333MHz (DDR3) 1400 Better QDR InfiniBand QLogic Model 12300/firmware 1200 Switch version 1000 Elapsed time 35% Improvement QDR InfiniBand QLogic QLE7340 software stack 800 Host Cnel Adapter 600 MPI v3.1 400 Operating System Red Hat® Enterprise Linux® Server release 5.3 200 MPI v4.0 0 Kernel 2.6.18-128.el5 16 32 64 File System IFS Mounted Number of cores 138
  • SOLUTION ResultsQLogic worked with Intel to tune and optimize the com- The test showed that MPI performance can improve by morepany’s latest MPI release – Intel MPI Library 4.0 – to improve than 35 percent using Intel MPI Library 4.0 with QLogic TrueScaleperformance when used with QLogic TrueScale InfiniBand. InfiniBand, compared to using Intel MPI Library 3.1.With MPI Library 4.0, applications can make full use of HighPerformance Computing (HPC) hardware, improving the over- QLOGIC TRUESCALE INFINIBAND ADAPTERSall performance of the applications on the clusters. QLogic TrueScale InfiniBand Adapters offer scalable performance, reliability, low power, and superior application performance.INTEL MPI LIBRARY 4.0 These adapters ensure superior performance of HPC applicationsIntel MPI Library 4.0 uses the high performance MPI-2 speci- by delivering the highest message rate for multicore computefication on multiple fabrics, which results in better perfor- nodes, the lowest scalable latency, large node count clusters, themance for applications on Intel architecture-based clusters. highest overall bandwidth on PCI Express Gen1 platforms, andThis library enables quick delivery of maximum end user superior power efficiency.performance, even if there are changes or upgrades to newinterconnects – without requiring major changes to thesoftware or operating environment. This high-performance,message-passing interface library develops applications thatcan run on multiple cluster fabric interconnects chosen by theuser at runtime.TestingQLogic used the Parallel Unstructured Maritime Aerodynam-ics (PUMA) Benchmark program to test the performance ofIntel MPI Library versions 4.0 and 3.1 with QLogic TrueScaleInfiniBand. The program analyses internal and externalnon-reacting compressible flows over arbitrarily complex 3Dgeometries. PUMA is written in ANSI C, and uses MPI librariesfor message passing. 139
  • INFINIBANDINFINIBAND FABRIC SUITE (IFS) – WHAT’S NEW IN VERSION 6.0 EFFICIENT PERFORMANCE  Industry’s lowest end-to-end latency  World-record message rate performance  The only solution with true liner scale of core count EFFICIENT MANAGEMENT  Automation to accelerate installations and upgrades  Detection and diagnosis of fabric issues in secondsFIGURE 4 INFINIBAND FABRIC SUITE™ 6.0 COMPONENTS  Initialization of 2,000 node clusters in seconds bricsTools EFFICIENT CAPACITY stFa et Fa  50 percent higher node count in same footprint d  Up to 67 percent improvement in bandwidth utilization ce n  Better performance with less hardware Fa Adva bric Fabric Manager Hosts QLOGIC’S TRUESCALE INFINIBAND SYSTEM ARCHITECTURE er iew DELIVERS END-TO-END EFFICIENCY Ser sV Services ric vi  Hardened Subnet Manager minimizes impact of fabric ce b s Fa disruptions  Complete set of FastFabric Tools maximizes system up-time  Fabric Viewer provides insights about network performance  Dispersive routing improves performance for all message passing interfaces (MPIs) 140
  •  Extreme message rate delivers unmatched application Fabric Manager. Offers administrative functions for subnet, performance InfiniBand fabric, and individual component management Comprehensive set of Host Services software tools quickly through HTML and a Java-based console. optimize High Performance Computing (HPC) environments Cost-effectively scale node counts with Advanced Topolo- Advanced Fabric Services. Delivers efficient fabric performance gies support with industry-leading technologies to automatically tune and Virtual Fabrics with quality of service (QoS) ensures consis- optimize an organization’s network. tent application performance Fabric intelligence through hardware-enabled Adaptive FastFabric Toolset. Includes easy-to-use tools that ensure Routing circumvents congestion bottlenecks effortless installation, configuration, and verification of the cluster network and virtual gateway I/O resources.QLOGIC INFINIBAND FABRIC SUITE 6.0QLogic leveraged its unique, system-level understanding of Fabric Viewer. Displays the Fabric Manager subnet manage-communications fabrics to deliver the industry’s most powerful ment facilities using a Java-based, stand-alone GUI.feature set available for fabric management software: QLogicInfiniBand Fabric Suite (IFS) 6.0. IFS enables users to obtain the Host Services. Provides a wide range of drivers and installershighest fabric performance, the greatest communications ef- for products from all adapter vendors, as well as optimized MPIficiency, and the lowest management costs for HPC clusters of user tools.any size. It takes a comprehensive and powerful end-to-end softwareIFS 6.0 components include: solution to bring the full power of InfiniBand to an organiza- Fabric Manager tion’s business applications. QLogic provides all the features Advanced Fabric Services needed for a highperformance fabric with QLogic InfiniBand FastFabric Toolset Fabric Suite 6.0. Fabric Viewer Host ServicesThe InfiniBand Fabric Suite architecture is modular: both Open-Fabrics Enterprise Distribution (OFED) and QLogic-developedsoftware modules can coexist in the same fabric. Major compo-nents of the architecture include: 141
  • INFINIBAND WHAT’S NEW IN QLOGIC IFS 6.0  Virtual Fabrics. Dedicate virtual lanes within the fabricINFINIBAND FABRIC SUITE (IFS) – WHAT’S NEW IN VERSION 6.0 to enable the highest utilization of compute resources by efficiently prioritizing and segmenting traffic.  Adaptive Routing. Eliminate performance slow downs caused by pathway bottlenecks. Only QLogic builds this intelligence into the chips. Changes occur in microseconds rather than minutes. The intelligence of the path selection scales as the fabric grows.  Dispersive Routing. Load-balance traffic along multiple pathways and improve MPI performance. QLogic continues to set and extend the standard with world-record message rate performance.  Mesh and Torus Topologies. Build extremely large fabrics more cost-efficiently. To ensure maximum performance on all nodes, multiple routing algorithms find the most ef- ficient route and eliminate congestion. 142
  • transtec HPC solutions excel through their easy managementand high usability, while maintaining high performance andquality throughout the whole lifetime of the system. As clus-ters scale, issues like congestion mitigation and Quality-of-Ser-vice can make a big difference in whether the fabric performsup to its full potential.With the intelligent choice of QLogic InfiniBand products,transtec remains true to combining the best componentstogether to provide the full best-of-breed solution stack to thecustomer.transtec HPC engineering experts are always available to fine-tune customers’ HPC cluster systems and InfiniBand fabrics toget the maximum performance while at the same time providethem with an easy-to-manage and easy-to-use HPC solution. 143
  • The amount of digital data resources has exploded in the pastdecade. The 2010 IDC Digital Universe Study shows that com-pared to 2008, the digital universe has grown by 62% or up to800,000 petabytes (0.8 zettabyte) in 2009. By the year 2020, IDCpredicts the amount of data to be 44 times as big as it was in2009, thus reaching approximately 35 zettabytes. As a result ofthe exploding amount of data, the demand for applications usedfor searching and analyzing large datasets will significantlygrow in all industries.However, today’s applications require large and expensive infra-structures which consume vast amounts of energy and resources,creating challenges in the analysis of mass-data. In the future, in-creasingly more terabyte-scale datasets will be used for research,analysis and diagnosis bringing about further difficulties. 145
  • PARSTREAM WHAT IS BIG DATA The amount of generated and stored data is growing expo-BIG DATA ANALYTICS nentially in unprecedented proportions. According to Eric Schmidt, former CEO of Google, the amount of data that is now produced in 2 days is as large as the total amount sinceFIGURE 1 the beginning of civilization until 2003. Large amounts of data are produced in many areas of busi- Real Time ness and private life. For example, there are 200 million Twitter Complex Event Processing < 1...10 milli sec messages generated – daily. Mobile devices, sensor networks, In-Memory DB 10...100 milli sec machine-to-machine communications, RFID readers, software InteractiveOperational 1 sec Analytics Big Data logs, cameras and microphones generate a continuous streamData Volumes Gigabyte Terabyte Petabyte Volume of data from multiple terabytes per day. 1...10 sec OLTP Reporting Batch Analytics For organizations Big Data means a great opportunity and a 1...10 min (MapReduce) big challenge. Companies that know how to manage big data > 10 min will create more comprehensive predictions for the market Lag Time and business. Decision-makers can quickly and safely respond to the developments in dynamic and volatile markets. It is important that companies know how to use Big Data toFIGURE 2 improve their competitive positions, increase productivity and develop new business models. According to Roger Magoulas C++ SQL API / JDBC / ODBC UDF -API of O’Reilly the ability to analyze big data has developed into a core competence in the information age and can provide an Real-Time Analytics Engine enormous competitive advantage for companies. Big Data analytics has a huge impact on traditional data In-Memory and Multi-Dimensional Disc Technology Partitioning processing and causes business and IT managers to face new High Performance Compressed Index problems. The latest IT technologies and processes are poorly Massively Parallel (HPCI) Shared Nothing suited to analyze very large amounts of data, according to a Processing (MPP) Architecture recent study by Gartner. Fast Hybrid Storage (Columnar/ Row) IT departments often try to meet the increasing flood of High Speed information by traditional means, using the same infrastruc- Loader with Low Latency ture development and just purchasing more hardware. As 146
  • data volumes grow stronger than the processing capabilities  scales linearly up to petabytesof additional servers or better processors, new technological  offers real-time analysis and continuous importapproaches are needed.  is cost and energy efficientSETBACKS IN CURRENT DATABASE ARCHITECTURES ParStream uses a unique indexing technology which enablesCurrent databases are not engineered for mass data, but efficient multi-threaded processing on parallel architectures.rather for small data volumes up to 100 million records. To- Hardware and energy costs are substantially reduced whileday’s databases use outdated, 20-30 year old architectures and overall performance is optimized to index and execute queries.the data and index structures are not constructed for efficientanalysis of such data volumes. And, because these databases Close-to-real-time analysis is obtained through simultaneousemploy sequential algorithms, they are not able to exploit the importing, indexing and querying of data. The developers ofpotential of parallel hardware. ParStream strive to continuously improve on the product byAlgorithmic procedures for indexing large amounts of data working closely with universities, research partners, and clients.have seen relatively few innovations in the past years anddecades. Due to the ever-growing amounts of data to be ParStream is the right choice when...processed, there are rising challenges that traditional data-  extreme amounts of data are to be searched and filteredbase systems cannot cope with. Currently, new and innovative  filters use many columns in various combinations (ad-hocapproaches to solve these problems are developed and evalu- queries)ated. However, some of these approaches seem to be heading  complex queries are performed frequentlyin the wrong direction.  datasets are continuously growing  close-to-real-time analytical results are expectedPARSTREAM – BIG DATA ANALYTICS PLATFORM  infrastructure and operating costs need to be optimizedParStream offers a revolutionary approach to high-performancedata analysis. It addresses the problems arising from rapidly in- KEY BUSINESS SCENARIOScreasing data volumes in modern business, as well as scientific Big Data is a game changer in all industries and the public sector.application scenarios. ParStream has identified many applications across all sectors that yield high returns for the customer. Examples are ad-spend-ParStream ing analytics and social media monitoring in eCommerce, fraud has a unique index technology detection and algorithmic trading in Finance/Capital markets, comes with efficient parallel processing network quality monitoring and customer targeting in Telcos, is ultra-fast, even with billions of records smart metering and smart grids in Utilities, and many more. 147
  • PARSTREAMBIG DATA ANALYTICS TECHNOLOGY The technical architecture of the ParStream database can be roughly divided into the following areas: During the extract, transform and load (ETL) process, the supplied data is read and processed so that it can be forwarded to the actual loader. In the next step, ParStream generates and stores necessary index structures and data representations. Input data can be stored in row and/or column-oriented data stores. Here configurable optimizations regarding sorting, compression and partitioning are accounted for. Once data is loaded into the server, the user can pose queries over a standard interface (SQL) to the database engine. A parser thenFIGURE 3 interprets the queries and a query executor executes it in parallel. The optimizer is used to generate the initial query plan starting High Speed Compression Caching Import & Index Partitioning Parallelzation from the declarative request. This initial execution plan is used to start processing. However, this initial plan can be altered dynami- ParStream cally during runtime, e.g., changing the level of parallelism. Data Loader Source ETL Query The query executor also makes optimal use of the available infra- Data ETL Rowori- structure. ETL ented Source Columnori- Query record store ented Index Store Logic Results All available parts of the query exploit bitmap information where Data record Source ETL store possible. Logical filter conditions and aggregations can thus be calculated and forwarded to the client by highly efficient bitmap Multiple Servers, CPUs and GPUs operations. 148
  • Index Structure environments. The ParStream database allows the usage of anParStream’s innovative index structure enables efficient parallel arbitrary number of servers. Each server can be configured toprocessing, allowing for unsurpassed levels of performance. replicate the entire or only parts of the data store.Key features of the underlying database engine include: A column-oriented bitmap index using a unique data struc- Several load-balancing algorithms will pass a complete query ture. to one of the servers or parts of one query to several, even A data structure that allows allows processing in com- redundant servers. The query can be sent to any of the cluster pressed form. members, allowing customers to use a load-balancing and fail- No need for index decompression as in other database and over configuration according to their own need, e.g. round robin indexing systems. query distribution.Use of ParStream and its highly efficient query engine yield sig- Interfacesnificant advantages over regular database systems. ParStream The ParStream server can be queried using any of the followingoffers: three methods: faster index operations, At a high level, we provide a JDBC driver which enables the shorter response times, implementation of a cross platform front end. This allows substantially less CPU-load, ParStream to be used with any application capable of using a efficient parallel processing of search and analysis, standard JDBC interface. For example, a Java applet can be built. capability of adding new data to the index during query This applet can be used to query the ParStream server from execution, within any web browser. close-to-real-time importing and querying of data. At mid-level, queries can be submitted as SQL code. Addition-Optimized Use of Resources ally, other descriptive query implementations are available,Inside one single processing node, as well as inside single data e.g., a JSON format. This allows the user to define queries, whichpartitions, query processing can be parallelized to achieve cannot easily be expressed in standard SQL. The results are thenminimal response time, by using all available resources (CPU, sent back to the client as CSV text or in binary format.I/O-Channels). At a low level, ParStream’s C++ API and base classes can be used to write user defined query nodes that are stored in dynamicReliability libraries. A developer can thus integrate his own query nodesThe reliability of a ParStream database is guaranteed through into a tree description and register it dynamically into theseveral product features and fully supports multiple-server ParStream server. Such user-defined queries can be executed 149
  • PARSTREAMBIG DATA ANALYTICS via a TCP/IP connection and are also integrated into ParStream’s parallel execution framework. This interface layer allows the formulation of queries that cannot be expressed using SQL. Data Import One of ParStream’s strengths is its ability to import CSV files atFIGURE 4 unprecedented speeds. This is based on two factors: First of all, the index is much faster in adding data than indexes used in most other databases. Secondly, the importer partitions Front End Application Tool and sorts the data in parallel, which exploits the capabilities of JDBC/ODBC Socket SOA SQL 2003 C++ API Compatibility today’s multi-core processors. Real-Time Big Data Analytics Additionally, the import process may run outside the query pro- cess enabling the user to ship the finished data and index files to the servers. In this way, the import’s CPU and I/O load can be Parallel CSV & Binary ETL- separated and moved to different machines. Import Import Support Map-Reduce RDBMS Raw-Data Another remarkable feature of ParStream is its ability to option- ally operate on a CSV record store instead of column stores. This accelerates the import because only the indexes need to be written. Plus, since one usually wants to save the original CSV files, no additional hard drive memory is wasted. 150
  • Supported PlatformsCurrently ParStream is available on a number of Linux Distri-butions including RedHat Enterprise Linux, Novell EnterpriseLinux and Debian Lenny running on X86_64 CPUs. On request,ParStream will be ported to other platforms. 151
  • PARSTREAM  Search and Selection – ParStream enables new levels of search and online shopping satisfaction. BIG DATA ANALYTICS By providing facetted search in combination with conti-FIGURE 5 nuously analyzing customers preferences and behavior in real-time, ParStream can guide customers very effectively to ALL INDUSTRIES products and services of their interest leading to increased conversion rate and platform revenue. eCommerce Social Telco Finance Energy Many More Services Networks Oil and Gas facetted Ad serving Customer Trend Smart Production  Real-Time Analytics – ParStream drives interactive analyticsMANY APPLICATIONS Search attrition analysis metering Profiling prevention Mining processes to gain insights faster. Web Fraud Smart grids analytics Targeting Network detection M2M monitoring Wind parks Identifying valuable insights from Big Data is ideally an inter- SEO- Automatic Sensors analytics Targeting trading Mining active process where hypotheses are identified and valida- Genetics Online- Prepaid Risk Solar Panels ted. ParStream delivers results instantaneously on up-to-date Advertising account analysis Intelligence mgmt Big Data even if many power-users or automated-agents are Wather querying the data at the same time. ParStream enables faster organizational learning for process and product optimizati- on.  Online-Processing – ParStream responds automaticallyFIGURE 6 to large data streams. For online advertising, re-targeting, trend-spotting, fraud detection and many more cases Par- STANDARD DW ARCHITECHTURE PARSTREAM ARCHITECTURE Stream can automatically process new data, compare to lar- Query Query ge historic data sets, and enable react quickly. Fast growing Long Query Each Query Uses Runtime Multiple data volumes and increasing algorithmic complexity create a Processor Cores market for faster solutions. HPCI Frequent Full Query execution Table Scans compressed indices OFFERING Continuous ParStream delivers a unique combination of real-time analytics, Nightly Batch - Import Import Assures Data is at Least Zimeliness of Parallel Import low-latency continuous import and high throughput for Big Data. 1 Day Old Data Data Big Data ParStream features analysis of billions of records with thousands of columns in sub-second time with very small computing infra- 152
  • structure requirements. ParStream does NOT use cubes to store example from social network or financial data feeds.data, which enables ParStream to analyze data in its full granu- ParStream licensing is based on data volume.larity, flexibly in all dimensions and to import new data on the fly. The product can be deployed in four ways that also can be com-ParStream delivers results typically 1,000 times faster than bined:PostgreSQL or MySQL. Compared to column-based databases,  As a traditional on premise Software deployment managedParStream delivers superior performance; this performance by the customer (including Cloud deployments),differential with traditional technology increases exponentially  As an Appliance managed by the customer,with increasing data volume and query complexity.  As a Cloud Service provided by a ParStream Service partner, andParStream has unique intellectual property in compression  As a Cloud Service provided by ParStream.and indexing. With High-Performance-Compressed-Index (HPCI)technology, data indices can be analyzed in a compressed form, These options provide flexibility and eases adoption as custo-removing the need to go through the extra step of decompres- mers can use ParStream on their preferred infrastructure orsion. This technological advantage results in major performance cloud, regardless of number of servers, cores, etc. Although priva-advantages compared to other solutions, including: te and public cloud-infrastructures are seen as the future model Substantially reducing CPU load (previously needed for de- for many applications, dedicated cluster-infrastructures are compression, index operations more efficient than full data frequently the preferred option for real-time Big Data scenarios. scanning) Substantially reducing RAM volume (previously needed to © 2012 by Parstream GmbH store the decompressed index) Providing much faster query response rates because HPCI processing can be parallelized eliminating the delays that result with decompressionCompared to state of the art data warehouse solutions, Par-Stream delivers results much faster, with more flexibility andgranularity, on more up-to-date data, with lower infrastructurerequirements and at lower TCO. ParStream uniquely addressesBig Data scenarios with its ability to continuously import dataand make it available to analytics in real-time. This allows foranalytics, filtering and conditioning of “live” data streams, for 153
  • PARSTREAMBIG DATA ANALYTICS Besides the detailed simulation of realistic processes, data analytics is is one of the classic application of High Perfor- mance Computing. What is new is that with the amount of data increasing such that we now speak of Big Data Analytics, and the required time-to-result getting shorter and shorter, special- ized solutions arise that apply the scale-out principle of HPC to areas that up to now have been in the focus of High Perfor- mance Computing. Parstream as a clustered database – optimized for highest read-only parallelized data throughput, next-to-realtime query results – is such a specialized solution. “As Big Data Analytics is becoming more and more important, new and innovative tech- transtec ensures that this latest innovative technology is nology arises, and we are happy that with implemented and run on the most reliable systems available, Parstream, our customers may experience a sized exactly according to the customer’s specific require- performance boost with respect to their analyt- ments. Parstream as the software technology core together ics work by a factor of up to 1,000.” with transtec HPC systems and transtec services combined, provide for the most reliable and comprehensive Big Data Analytics solutions available. Matthias Groß HPC Sales Specialist 154
  • 155
  • GLOSSARYACML (“AMD Core Math Library“) lar, vector and vector-vector operations, the Level 2 BLAS performA software development library released by AMD. This library matrix-vector operations, and the Level 3 BLAS perform matrix-provides useful mathematical routines optimized for AMD pro- matrix operations. Because the BLAS are efficient, portable, andcessors. Originally developed in 2002 for use in high-performance widely available, they are commonly used in the developmentcomputing (HPC) and scientific computing, ACML allows nearly of high quality linear algebra software, e.g. → LAPACK . Althoughoptimal use of AMD Opteron processors in compute-intensive ap- a model Fortran implementation of the BLAS in available fromplications. ACML consists of the following main components: netlib in the BLAS library, it is not expected to perform as well as a A full implementation of Level 1, 2 and 3 Basic Linear Algebra specially tuned implementation on most high-performance com- Subprograms (→ BLAS), with optimizations for AMD Opteron puters – on some machines it may give much worse performance processors. – but it allows users to run → LAPACK software on machines that A full suite of Linear Algebra (→ LAPACK) routines. do not offer any other implementation of the BLAS. A comprehensive suite of Fast Fourier transform (FFTs) in single-, double-, single-complex and double-complex data Cg (“C for Graphics”) types. A high-level shading language developed by Nvidia in close Fast scalar, vector, and array math transcendental library collaboration with Microsoft for programming vertex and pixel routines. shaders. It is very similar to Microsoft’s → HLSL. Cg is based on Random Number Generators in both single- and double- the C programming language and although they share the same precision. syntax, some features of C were modified and new data typesAMD offers pre-compiled binaries for Linux, Solaris, and Windows were added to make Cg more suitable for programming graphicsavailable for download. Supported compilers include gfortran, processing units. This language is only suitable for GPU pro-Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathScale, gramming and is not a general programming language. The CgPGI compiler, and Sun Studio. compiler outputs DirectX or OpenGL shader programs.BLAS (“Basic Linear Algebra Subprograms“) CISC (“complex instruction-set computer”)Routines that provide standard building blocks for performing A computer instruction set architecture (ISA) in which each in-basic vector and matrix operations. The Level 1 BLAS perform sca- struction can execute several low-level operations, such as a load 156
  • from memory, an arithmetic operation, and a memory store, all in CUDA toolkita single instruction. The term was retroactively coined in contrast Part of → CUDAto reduced instruction set computer (RISC). The terms RISC andCISC have become less meaningful with the continued evolution CUDA (“Compute Uniform Device Architecture”)of both CISC and RISC designs and implementations, with mod- A parallel computing architecture developed by NVIDIA. CUDAern processors also decoding and splitting more complex instruc- is the computing engine in NVIDIA graphics processing units ortions into a series of smaller internal micro-operations that can GPUs that is accessible to software developers through indus-thereby be executed in a pipelined fashion, thus achieving high try standard programming languages. Programmers use “C forperformance on a much larger subset of instructions. CUDA” (C with NVIDIA extensions), compiled through a PathS- cale Open64 C compiler, to code algorithms for execution oncluster the GPU. CUDA architecture supports a range of computationalAggregation of several, mostly identical or similar systems to interfaces including → OpenCL and → DirectCompute. Thirda group, working in parallel on a problem. Previously known Threadas Beowulf Clusters, HPC clusters are composed of commodity per-Thread Privatehardware, and are scalable in design. The more machines are Local Memoryadded to the cluster, the more performance can in principle beachieved. Thread Block per-Blockcontrol protocol Shared MemoryPart of the → parallel NFS standard Grid 0CUDA driver APIPart of → CUDA ... per- Application Context Grid 1 GlobalCUDA SDK MemoryPart of → CUDA ... 157
  • GLOSSARYparty wrappers are also available for Python, Fortran, Java and model, each thread has a per-thread private memory spaceMatlab. CUDA works with all NVIDIA GPUs from the G8X series used for register spills, function calls, and C automatic arrayonwards, including GeForce, Quadro and the Tesla line. CUDA variables. Each thread block has a per-block shared memoryprovides both a low level API and a higher level API. The initial space used for inter-thread communication, data sharing,CUDA SDK was made public on 15 February 2007, for Microsoft and result sharing in parallel algorithms. Grids of threadWindows and Linux. Mac OS X support was later added in ver- blocks share results in global memory space after kernel-sion 2.0, which supersedes the beta released February 14, 2008. wide global synchronization.CUDA is the hardware and software architecture that enables CUDA’s hierarchy of threads maps to a hierarchy of proces-NVIDIA GPUs to execute programs written with C, C++, Fortran, sors on the GPU; a GPU executes one or more kernel grids; a→ OpenCL, → DirectCompute, and other languages. A CUDA streaming multiprocessor (SM) executes one or more threadprogram calls parallel kernels. A kernel executes in parallel blocks; and CUDA cores and other execution units in theacross a set of parallel threads. The programmer or compiler SM execute threads. The SM executes threads in groups oforganizes these threads in thread blocks and grids of thread 32 threads called a warp. While programmers can generallyblocks. The GPU instantiates a kernel program on a grid of par- ignore warp execution for functional correctness and thinkallel thread blocks. Each thread within a thread block executes of programming one thread, they can greatly improve perfor-an instance of the kernel, and has a thread ID within its thread mance by having threads in a warp execute the same codeblock, program counter, registers, per-thread private memory, path and access memory in nearby addresses. See the maininputs, and output results. article “GPU Computing” for further details.A thread block is a set of concurrently executing threads that DirectComputecan cooperate among themselves through barrier synchro- An application programming interface (API) that supportsnization and shared memory. A thread block has a block general-purpose computing on graphics processing unitsID within its grid. A grid is an array of thread blocks that (GPUs) on Microsoft Windows Vista or Windows 7. Direct-execute the same kernel, read inputs from global memory, Compute is part of the Microsoft DirectX collection of APIswrite results to global memory, and synchronize between and was initially released with the DirectX 11 API but runsdependent kernel calls. In the CUDA parallel programming on both DirectX 10 and DirectX 11 GPUs. The DirectCompute 158
  • architecture shares a range of computational interfaces with age is required. In general, the goal of the extraction phase is to→ OpenCL and → CUDA. convert the data into a single format which is appropriate for transformation processing.ETL (“Extract, Transform, Load”) An intrinsic part of the extraction involves the parsing of ex-A process in database usage and especially in data warehousing tracted data, resulting in a check if the data meets an expectedthat involves: pattern or structure. If not, the data may be rejected entirely or Extracting data from outside sources in part. Transforming it to fit operational needs (which can include The transform stage applies a series of rules or functions to the quality levels) extracted data from the source to derive the data for loading Loading it into the end target (database or data ware- into the end target. Some data sources will require very little or house) even no manipulation of data. In other cases, one or more of theThe first part of an ETL process involves extracting the data following transformation types may be required to meet thefrom the source systems. In many cases this is the most chal- business and technical needs of the target database.lenging aspect of ETL, as extracting data correctly will set The load phase loads the data into the end target, usually thethe stage for how subsequent processes will go. Most data data warehouse (DW). Depending on the requirements of thewarehousing projects consolidate data from different source organization, this process varies widely. Some data warehousessystems. Each separate system may also use a different data may overwrite existing information with cumulative informa-organization/format. Common data source formats are rela- tion, frequently updating extract data is done on daily, weeklytional databases and flat files, but may include non-relational or monthly basis. Other DW (or even other parts of the samedatabase structures such as Information Management System DW) may add new data in a historicized form, for example,(IMS) or other data structures such as Virtual Storage Access hourly. To understand this, consider a DW that is requiredMethod (VSAM) or Indexed Sequential Access Method (ISAM), to maintain sales records of the last year. Then, the DW willor even fetching from outside sources such as through web overwrite any data that is older than a year with newer data.spidering or screen-scraping. The streaming of the extracted However, the entry of data for any one year window will bedata source and load on-the-fly to the destination database is made in a historicized manner. The timing and scope to replaceanother way of performing ETL when no intermediate data stor- or append are strategic design choices dependent on the time 159
  • GLOSSARYavailable and the business needs. More complex systems can tic formats, interchange formats, rounding algorithms, opera-maintain a history and audit trail of all changes to the data tions, and exception handling. The standard also includesloaded in the DW. As the load phase interacts with a database, extensive recommendations for advanced exception handling,the constraints defined in the database schema — as well as additional operations (such as trigonometric functions),in triggers activated upon data load — apply (for example, expression evaluation, and for achieving reproducible results.uniqueness, referential integrity, mandatory fields), which also The standard defines single-precision, double-precision, ascontribute to the overall data quality performance of the ETL well as 128-byte quadruple-precision floating point numbers.process. In the proposed 754r version, the standard also defines the 2-byte half-precision number format.FFTW (“Fastest Fourier Transform in the West”)A software library for computing discrete Fourier transforms FraunhoferFS (FhGFS)(DFTs), developed by Matteo Frigo and Steven G. Johnson at A high-performance parallel file system from the Fraunhoferthe Massachusetts Institute of Technology. FFTW is known as Competence Center for High Performance Computing. Builtthe fastest free software implementation of the Fast Fourier on scalable multithreaded core components with native →transform (FFT) algorithm (upheld by regular benchmarks). It InfiniBand support, file system nodes can serve → InfiniBandcan compute transforms of real and complex-valued arrays of and Ethernet (or any other TCP-enabled network) connectionsarbitrary size and dimension in O(n log n) time. at the same time and automatically switch to a redundant connection path in case any of them fails. One of the mostfloating point standard (IEEE 754) fundamental concepts of FhGFS is the strict avoidance ofThe most widely-used standard for floating-point computa- architectural bottle necks. Striping file contents across mul-tion, and is followed by many hardware (CPU and FPU) and tiple storage servers is only one part of this concept. Anothersoftware implementations. Many computer languages allow important aspect is the distribution of file system metadataor require that some or all arithmetic be carried out using IEEE (e.g. directory information) across multiple metadata servers.754 formats and operations. The current version is IEEE 754- Large systems and metadata intensive applications in general2008, which was published in August 2008; the original IEEE can greatly profit from the latter feature.754-1985 was published in 1985. The standard defines arithme- FhGFS requires no dedicated file system partition on the 160
  • servers – it uses existing partitions, formatted with any of Robert Harrison and Ian Foster. The GA library is incorporatedthe standard Linux file systems, e.g. XFS or ext4. For larger into many quantum chemistry packages, including NWChem,networks, it is also possible to create several distinct FhGFS MOLPRO, UTChem, MOLCAS, and TURBOMOLE. The GA toolkit isfile system partitions with different configurations. FhGFS pro- free software, licensed under a self-made license.vides a coherent mode, in which it is guaranteed that changesto a file or directory by one client are always immediately Globus Toolkitvisible to other clients. An open source toolkit for building computing grids developed and provided by the Globus Alliance, currently at version 5.Global Arrays (GA)A library developed by scientists at Pacific Northwest National GMP (“GNU Multiple Precision Arithmetic Library”)Laboratory for parallel computing. GA provides a friendly API for A free library for arbitrary-precision arithmetic, operating onshared-memory programming on distributed-memory comput- signed integers, rational numbers, and floating point numbers.ers for multidimensional arrays. The GA library is a predecessor There are no practical limits to the precision except the onesto the GAS (global address space) languages currently being de- implied by the available memory in the machine GMP runs onveloped for high-performance computing. The GA toolkit has ad- (operand dimension limit is 231 bits on 32-bit machines and 237ditional libraries including a Memory Allocator (MA), Aggregate bits on 64-bit machines). GMP has a rich set of functions, andRemote Memory Copy Interface (ARMCI), and functionality for the functions have a regular interface. The basic interface isout-of-core storage of arrays (ChemIO). Although GA was initially for C but wrappers exist for other languages including C++, C#,developed to run with TCGMSG, a message passing library that OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtualcame before the → MPI standard (Message Passing Interface), it machine used GMP to support Java built-in arbitrary precisionis now fully compatible with → MPI. GA includes simple matrix arithmetic. This feature has been removed from recent releases,computations (matrix-matrix multiplication, LU solve) and causing protests from people who claim that they used Kaffeworks with → ScaLAPACK. Sparse matrices are available but the solely for the speed benefits afforded by GMP. As a result, GMPimplementation is not optimal yet. GA was developed by Jarek support has been added to GNU Classpath. The main target ap-Nieplocha, Robert Harrison and R. J. Littlefield. The ChemIO li- plications of GMP are cryptography applications and research,brary for out-of-core storage was developed by Jarek Nieplocha, Internet security applications, and computer algebra systems. 161
  • GLOSSARYGotoBLAS for it because of its widespread use, and because it has a wellKazushige Goto’s implementation of → BLAS. defined architecture for extensions to the protocol (which may be dynamically discovered).grid (in CUDA architecture)Part of the → CUDA programming model Hierarchical Data Format (HDF) A set of file formats and libraries designed to store and organizeGridFTP large amounts of numerical data. Originally developed at theAn extension of the standard File Transfer Protocol (FTP) for National Center for Supercomputing Applications, it is currentlyuse with Grid computing. It is defined as part of the → Globus supported by the non-profit HDF Group, whose mission is totoolkit, under the organisation of the Global Grid Forum ensure continued development of HDF5 technologies, and the(specifically, by the GridFTP working group). The aim of continued accessibility of data currently stored in HDF. In keep-GridFTP is to provide a more reliable and high performance ing with this goal, the HDF format, libraries and associated toolsfile transfer for Grid computing applications. This is neces- are available under a liberal, BSD-like license for general use.sary because of the increased demands of transmitting data HDF is supported by many commercial and non-commercialin Grid computing – it is frequently necessary to transmit software platforms, including Java, MATLAB, IDL, and Python.very large files, and this needs to be done fast and reliably. The freely available HDF distribution consists of the library,GridFTP is the answer to the problem of incompatibility command-line utilities, test suite source, Java interface, and thebetween storage and access systems. Previously, each data Java-based HDF Viewer (HDFView). There currently exist two ma-provider would make their data available in their own spe- jor versions of HDF, HDF4 and HDF5, which differ significantly incific way, providing a library of access functions. This made design and API.it difficult to obtain data from multiple sources, requiring adifferent access method for each, and thus dividing the total HLSL (“High Level Shader Language“)available data into partitions. GridFTP provides a uniform The High Level Shader Language or High Level Shading Lan-way of accessing the data, encompassing functions from all guage (HLSL) is a proprietary shading language developed bythe different modes of access, building on and extending the Microsoft for use with the Microsoft Direct3D API. It is analo-universally accepted FTP standard. FTP was chosen as a basis gous to the GLSL shading language used with the OpenGL 162
  • standard. It is very similar to the NVIDIA Cg shading language, as a connection between processor nodes and high performanceit was developed alongside it. I/O nodes such as storage devices. Like → PCI Express, and manyHLSL programs come in three forms, vertex shaders, geometry other modern interconnects, InfiniBand offers point-to-point bi-shaders, and pixel (or fragment) shaders. A vertex shader is directional serial links intended for the connection of processorsexecuted for each vertex that is submitted by the application, with high-speed peripherals such as disks. On top of the point toand is primarily responsible for transforming the vertex from point capabilities, InfiniBand also offers multicast operations asobject space to view space, generating texture coordinates, well. It supports several signalling rates and, as with PCI Express,and calculating lighting coefficients such as the vertex’s tan- links can be bonded together for additional throughput.gent, binormal and normal vectors. When a group of vertices(normally 3, to form a triangle) come through the vertex shader, The SDR serial connection’s signalling rate is 2.5 gigabit pertheir output position is interpolated to form pixels within its second (Gbit/s) in each direction per connection. DDR is 5 Gbit/sarea; this process is known as rasterisation. Each of these pixels and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125comes through the pixel shader, whereby the resultant screen Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encodingcolour is calculated. – every 10 bits sent carry 8 bits of data – making the useful data transmission rate four-fifths the raw rate. Thus single, double,Optionally, an application using a Direct3D10 interface and and quad data rates carry 2, 4, or 8 Gbit/s useful data, respective-Direct3D10 hardware may also specify a geometry shader. This ly. For FDR and EDR, links use 64B/66B encoding – every 66 bitsshader takes as its input the three vertices of a triangle and sent carry 64 bits of data.uses this data to generate (or tessellate) additional triangles,which are each then sent to the rasterizer. Implementers can aggregate links in units of 4 or 12, called 4X or 12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/sInfiniBand of useful data. As of 2009 most systems use a 4X aggregate, imply-Switched fabric communications link primarily used in HPC and ing a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) connec-enterprise data centers. Its features include high throughput, tion. Larger systems with 12X links are typically used for clusterlow latency, quality of service and failover, and it is designed to and supercomputer interconnects and for inter-switch connec-be scalable. The InfiniBand architecture specification defines tions. 163
  • GLOSSARYThe single data rate switch chips have a latency of 200 nano- points are of the matrix type, a third are of the signal typeseconds, DDR switch chips have a latency of 140 nanoseconds and the remainder are of the image and cryptography types.and QDR switch chips have a latency of 100 nanoseconds. The Intel IPP functions are divided into 4 data types: Data typesend-to-end latency range ranges from 1.07 microseconds MPI include 8u (8-bit unsigned), 8s (8-bit signed), 16s, 32f (32-bitlatency to 1.29 microseconds MPI latency to 2.6 microseconds. floating-point), 64f, etc. Typically, an application developerAs of 2009 various InfiniBand host channel adapters (HCA) ex- works with only one dominant data type for most processingist in the market, each with different latency and bandwidth functions, converting between input to processing to outputcharacteristics. InfiniBand also provides RDMA capabilities for formats at the end points. Version 5.2 was introduced June 5,low CPU overhead. The latency for RDMA operations is less than 2007, adding code samples for data compression, new video1 microsecond. codec support, support for 64-bit applications on Mac OS X, support for Windows Vista, and new functions for ray-tracingSee the main article “InfiniBand” for further description of and rendering. Version 6.1 was released with the Intel C++InfiniBand features Compiler on June 28, 2009 and Update 1 for version 6.1 was released on July 28, 2009.Intel Integrated Performance Primitives (Intel IPP)A multi-threaded software library of functions for multime- Intel Threading Building Blocks (TBB)dia and data processing applications, produced by Intel. The A C++ template library developed by Intel Corporation forlibrary supports Intel and compatible processors and is avail- writing software programs that take advantage of multi-able for Windows, Linux, and Mac OS X operating systems. core processors. The library consists of data structures andIt is available separately or as a part of Intel Parallel Studio. algorithms that allow a programmer to avoid some compli-The library takes advantage of processor features includ- cations arising from the use of native threading packagesing MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AES-NI and multicore such as POSIX → threads, Windows → threads, or the portableprocessors. Intel IPP is divided into four major processing Boost Threads in which individual → threads of execution aregroups: Signal (with linear array or vector data), Image (with created, synchronized, and terminated manually. Instead the2D arrays for typical color spaces), Matrix (with n x m arrays library abstracts access to the multiple processors by allow-for matrix operations), and Cryptography. Half the entry ing the operations to be treated as “tasks”, which are allocat- 164
  • ed to individual cores dynamically by the library’s run-time the iSCSI protocol. iSER is one Datamover protocol. The inter-engine, and by automating efficient use of the CPU cache. A face between the iSCSI and a Datamover protocol, iSER in thisTBB program creates, synchronizes and destroys graphs of case, is called Datamover Interface (DI).dependent tasks according to algorithms, i.e. high-level par-allel programming paradigms (a.k.a. Algorithmic Skeletons). iWARP (“Internet Wide Area RDMA Protocol”)Tasks are then executed respecting graph dependencies. An Internet Engineering Task Force (IETF) update of theThis approach groups TBB in a family of solutions for parallel RDMA Consortium’s → RDMA over TCP standard. This laterprogramming aiming to decouple the programming from the standard is zero-copy transmission over legacy TCP. Becauseparticulars of the underlying machine. Intel TBB is available a kernel implementation of the TCP stack is a tremendouscommercially as a binary distribution with support and in bottleneck, a few vendors now implement TCP in hard-open source in both source and binary forms. Version 4.0 was ware. This additional hardware is known as the TCP offloadintroduced on September 8, 2011. engine (TOE). TOE itself does not prevent copying on the receive side, and must be combined with RDMA hardwareiSER (“iSCSI Extensions for RDMA“) for zero-copy results. The main component is the DataA protocol that maps the iSCSI protocol over a network that Direct Protocol (DDP), which permits the actual zero-copyprovides RDMA services (like → iWARP or → InfiniBand). This transmission. The transmission itself is not performed bypermits data to be transferred directly into SCSI I/O buffers DDP, but by TCP.without intermediate data copies. The Datamover Architecture(DA) defines an abstract model in which the movement of data kernel (in CUDA architecture)between iSCSI end nodes is logically separated from the rest of Part of the → CUDA programming model SDR DDR QDR FDR EDR Single Data Rate Double Data Rate Quadruple Data Rate Fourteen Data Rate Enhanced Data Rate 1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s 4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s 12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s 165
  • GLOSSARYLAM/MPI LINPACKA high-quality open-source implementation of the → MPI speci- A collection of Fortran subroutines that analyze and solvefication, including all of MPI-1.2 and much of MPI-2. Superseded linear equations and linear least-squares problems. LINPACKby the → OpenMPI implementation- was designed for supercomputers in use in the 1970s and early 1980s. LINPACK has been largely superseded by → LAPACK,LAPACK (“linear algebra package”) which has been designed to run efficiently on shared-memory,Routines for solving systems of simultaneous linear equations, vector supercomputers. LINPACK makes use of the → BLASleast-squares solutions of linear systems of equations, eigen- libraries for performing basic vector and matrix operations. Thevalue problems, and singular value problems. The original goal LINPACK benchmarks are a measure of a system‘s floating pointof the LAPACK project was to make the widely used EISPACK and computing power and measure how fast a computer solves a→ LINPACK libraries run efficiently on shared-memory vector dense N by N system of linear equations Ax=b, which is a com-and parallel processors. LAPACK routines are written so that as mon task in engineering. The solution is obtained by Gaussianmuch as possible of the computation is performed by calls to elimination with partial pivoting, with 2/3•N³ + 2•N² floatingthe → BLAS library. While → LINPACK and EISPACK are based on point operations. The result is reported in millions of floatingthe vector operation kernels of the Level 1 BLAS, LAPACK was de- point operations per second (MFLOP/s, sometimes simply calledsigned at the outset to exploit the Level 3 BLAS. Highly efficient MFLOPS).machine-specific implementations of the BLAS are available formany modern high-performance computers. The BLAS enable LNETLAPACK routines to achieve high performance with portable Communication protocol in → Lustresoftware. logical object volume (LOV)layout A logical entity in → LustrePart of the → parallel NFS standard. Currently three types oflayout exist: file-based, block/volume-based, and object-based, Lustrethe latter making use of → object-based storage devices An object-based → parallel file system 166
  • management server (MGS) phase - provided all outputs of the map operation that shareA functional component in → Lustre the same key are presented to the same reducer at the same time. While this process can often appear inefficient com-MapReduce pared to algorithms that are more sequential, MapReduce canA framework for processing highly distributable problems be applied to significantly larger datasets than “commodity”across huge datasets using a large number of computers servers can handle – a large server farm can use MapReduce(nodes), collectively referred to as a cluster (if all nodes use the to sort a petabyte of data in only a few hours. The parallelismsame hardware) or a grid (if the nodes use different hardware). also offers some possibility of recovering from partial failureComputational processing can occur on data stored either in a of servers or storage during the operation: if one mapper or re-filesystem (unstructured) or in a database (structured). ducer fails, the work can be rescheduled – assuming the input“Map” step: The master node takes the input, divides it into data is still available.smaller sub-problems, and distributes them to worker nodes. Aworker node may do this again in turn, leading to a multi-level metadata server (MDS)tree structure. The worker node processes the smaller problem, A functional component in → Lustreand passes the answer back to its master node.“Reduce” step: The master node then collects the answers to all metadata target (MDT)the sub-problems and combines them in some way to form the A logical entity in → Lustreoutput – the answer to the problem it was originally trying tosolve. MKL (“Math Kernel Library”)MapReduce allows for distributed processing of the map and A library of optimized, math routines for science, engineering,reduction operations. Provided each mapping operation is and financial applications developed by Intel. Core math func-independent of the others, all maps can be performed in paral- tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers,lel – though in practice it is limited by the number of inde- Fast Fourier Transforms, and Vector Math. The library supportspendent data sources and/or the number of CPUs near each Intel and compatible processors and is available for Windows,source. Similarly, a set of ‘reducers’ can perform the reduction Linux and Mac OS X operating systems. 167
  • GLOSSARYMPI, MPI-2 (“message-passing interface”) MPP (“massively parallel processing”)A language-independent communications protocol used So-called MPP jobs are computer programs with several partsto program parallel computers. Both point-to-point and running on several machines in parallel, often calculating simu-collective communication are supported. MPI remains the lation problems. The communication between these parts candominant model used in high-performance computing today. e.g. be realized by the → MPI software interface.There are two versions of the standard that are currentlypopular: version 1.2 (shortly called MPI-1), which emphasizes MS-MPImessage passing and has a static runtime environment, and Microsoft → MPI 2.0 implementation shipped with MicrosoftMPI-2.1 (MPI-2), which includes new features such as paral- HPC Pack 2008 SDK, based on and designed for maximum com-lel I/O, dynamic process management and remote memory patibility with the → MPICH2 reference implementation.operations. MPI-2 specifies over 500 functions and provideslanguage bindings for ANSI C, ANSI Fortran (Fortran90), and MVAPICH2ANSI C++. Interoperability of objects defined in MPI was also An → MPI 2.0 implementation based on → MPICH2 and devel-added to allow for easier mixed-language message pass- oped by the Department of Computer Science and Engineeringing programming. A side effect of MPI-2 standardization at Ohio State University. It is available under BSD licensing and(completed in 1996) was clarification of the MPI-1 standard, supports MPI over InfiniBand, 10GigE/iWARP and RDMAoE.creating the MPI-1.2 level. MPI-2 is mostly a superset of MPI-1,although some functions have been deprecated. Thus MPI-1.2 NetCDF (“Network Common Data Form”)programs still work under MPI implementations compliant A set of software libraries and self-describing, machine-inde-with the MPI-2 standard. The MPI Forum reconvened in 2007, pendent data formats that support the creation, access, andto clarify some MPI-2 issues and explore developments for a sharing of array-oriented scientific data. The project homep-possible MPI-3. age is hosted by the Unidata program at the University Cor- poration for Atmospheric Research (UCAR). They are also theMPICH2 chief source of NetCDF software, standards development,A freely available, portable → MPI 2.0 implementation, main- updates, etc. The format is an open standard. NetCDF Clas-tained by Argonne National Laboratory sic and 64-bit Offset Format are an international standard 168
  • of the Open Geospatial Consortium. The project is actively NFS (Network File System)supported by UCAR. The recently released (2008) version 4.0 A network file system protocol originally developed by Sungreatly enhances the data model by allowing the use of the Microsystems in 1984, allowing a user on a client computer→ HDF5 data file format. Version 4.1 (2010) adds support for to access files over a network in a manner similar to howC and Fortran client access to specified subsets of remote local storage is accessed. NFS, like many other protocols,data via OPeNDAP. The format was originally based on the builds on the Open Network Computing Remote Procedureconceptual model of the NASA CDF but has since diverged Call (ONC RPC) system. The Network File System is an openand is not compatible with it. It is commonly used in clima- standard defined in RFCs, allowing anyone to implement thetology, meteorology and oceanography applications (e.g., protocol.weather forecasting, climate change) and GIS applications. Itis an input/output format for many GIS applications, and for Sun used version 1 only for in-house experimental purposes.general scientific data exchange. The NetCDF C library, and When the development team added substantial changes tothe libraries based on it (Fortran 77 and Fortran 90, C++, and NFS version 1 and released it outside of Sun, they decided toall third-party libraries) can, starting with version 4.1.1, read release the new version as V2, so that version interoperationsome data in other data formats. Data in the → HDF5 format and RPC version fallback could be tested. Version 2 of thecan be read, with some restrictions. Data in the → HDF4 for- protocol (defined in RFC 1094, March 1989) originally oper-mat can be read by the NetCDF C library if created using the ated entirely over UDP. Its designers meant to keep the proto-→ HDF4 Scientific Data (SD) API. col stateless, with locking (for example) implemented outside of the core protocol Version 3 (RFC 1813, June 1995) added:NetworkDirectA remote direct memory access (RDMA)-based network inter-  support for 64-bit file sizes and offsets, to handle files largerface implemented in Windows Server 2008 and later. Net- than 2 gigabytes (GB)workDirect uses a more direct path from → MPI applications  support for asynchronous writes on the server, to improveto networking hardware, resulting in very fast and efficient write performancenetworking. See the main article “Windows HPC Server 2008  additional file attributes in many replies, to avoid the needR2” for further details. to re-fetch them 169
  • GLOSSARY a READDIRPLUS operation, to get file handles and attributes method of separating the filesystem meta-data from the location along with file names when scanning a directory of the file data; it goes beyond the simple name/data separation assorted other improvements by striping the data amongst a set of data servers. This is differ- ent from the traditional NFS server which holds the names ofAt the time of introduction of Version 3, vendor support for TCP files and their data under the single umbrella of the server.as a transport-layer protocol began increasing. While severalvendors had already added support for NFS Version 2 with TCP In addition to pNFS, NFSv 4.1 provides sessions, directoryas a transport, Sun Microsystems added support for TCP as a delegation and notifications, multi-server namespace, ac-transport for NFS at the same time it added support for Ver- cess control lists (ACL/SACL/DACL), retention attributions, andsion 3. Using TCP as a transport made using NFS over a WAN SECINFO_NO_NAME. See the main article “Parallel Filesystems”more feasible. for further details.Version 4 (RFC 3010, December 2000; revised in RFC 3530, April Current work is being done in preparing a draft for a future2003), influenced by AFS and CIFS, includes performance im- version 4.2 of the NFS standard, including so-called federatedprovements, mandates strong security, and introduces a state- filesystems, which constitute the NFS counterpart of Micro-ful protocol. Version 4 became the first version developed with soft’s distributed filesystem (DFS).the Internet Engineering Task Force (IETF) after Sun Microsys-tems handed over the development of the NFS protocols. NUMA (“non-uniform memory access”) A computer memory design used in multiprocessors, where theNFS version 4 minor version 1 (NFSv 4.1) has been approved by memory access time depends on the memory location relativethe IESG and received an RFC number since Jan 2010. The NFSv 4.1 to a processor. Under NUMA, a processor can access its own lo-specification aims: to provide protocol support to take advantage cal memory faster than non-local memory, that is, memory localof clustered server deployments including the ability to provide to another processor or memory shared between processors.scalable parallel access to files distributed among multipleservers. NFSv 4.1 adds the parallel NFS (pNFS) capability, which object storage server (OSS)enables data access parallelism. The NFSv 4.1 protocol defines a A functional component in → Lustre 170
  • object storage target (OST) OLAP cube (“Online Analytical Processing”)A logical entity in → Lustre A set of data, organized in a way that facilitates non-predeter- mined queries for aggregated information, or in other words,object-based storage device (OSD) online analytical processing. OLAP is one of the computer-basedAn intelligent evolution of disk drives that can store and serve techniques for analyzing business data that are collectivelyobjects rather than simply place data on tracks and sectors. This called business intelligence. OLAP cubes can be thought of astask is accomplished by moving low-level storage functions into extensions to the two-dimensional array of a spreadsheet. Forthe storage device and accessing the device through an object example a company might wish to analyze some financial datainterface. Unlike a traditional block-oriented device providing ac- by product, by time-period, by city, by type of revenue and cost,cess to data organized as an array of unrelated blocks, an object and by comparing actual data with a budget. These additionalstore allows access to data by means of storage objects. A stor- methods of analyzing the data are known as dimensions.age object is a virtual entity that groups data together that has Because there can be more than three dimensions in an OLAPbeen determined by the user to be logically related. Space for a system the term hypercube is sometimes used.storage object is allocated internally by the OSD itself instead ofby a host-based file system. OSDs manage all necessary low-level OpenCL (“Open Computing Language”)storage, space management, and security functions. Because A framework for writing programs that execute across hetero-there is no host-based metadata for an object (such as inode geneous platforms consisting of CPUs, GPUs, and other proces-information), the only way for an application to retrieve an object sors. OpenCL includes a language (based on C99) for writingis by using its object identifier (OID). The SCSI interface was modi- kernels (functions that execute on OpenCL devices), plus APIsfied and extended by the OSD Technical Work Group of the Stor- that are used to define and then control the platforms. OpenCLage Networking Industry Association (SNIA) with varied industry provides parallel computing using task-based and data-basedand academic contributors, resulting in a draft standard to T10 in parallelism. OpenCL is analogous to the open industry stan-2004. This standard was ratified in September 2004 and became dards OpenGL and OpenAL, for 3D graphics and computer au-the ANSI T10 SCSI OSD V1 command set, released as INCITS 400- dio, respectively. Originally developed by Apple Inc., which holds2004. The SNIA group continues to work on further extensions to trademark rights, OpenCL is now managed by the non-profitthe interface, such as the ANSI T10 SCSI OSD V2 command set. technology consortium Khronos Group. 171
  • GLOSSARYOpenMP (“Open Multi-Processing”) across the network, and the data is accessed in a striped wayAn application programming interface (API) that supports over the storage access paths in order to increase performance.multi-platform shared memory multiprocessing programming See the main article “Parallel Filesystems” for further details.in C, C++ and Fortran on many architectures, including Unix andMicrosoft Windows platforms. It consists of a set of compiler parallel NFS (pNFS)directives, library routines, and environment variables that A → parallel file system standard, optional part of the current →influence run-time behavior. NFS standard 4.1. See the main article “Parallel Filesystems” for further details.Jointly defined by a group of major computer hardware andsoftware vendors, OpenMP is a portable, scalable model that PCI Express (PCIe)gives programmers a simple and flexible interface for develop- A computer expansion card standard designed to replaceing parallel applications for platforms ranging from the desktop the older PCI, PCI-X, and AGP standards. Introduced by Intelto the supercomputer. in 2004, PCIe (or PCI-E, as it is commonly called) is the latest standard for expansion cards that is available on mainstreamAn application built with the hybrid model of parallel program- computers. PCIe, unlike previous PC expansion standards, isming can run on a computer cluster using both OpenMP and structured around point-to-point serial links, a pair of whichMessage Passing Interface (MPI), or more transparently throughthe use of OpenMP extensions for non-shared memory systems. PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0 x1 256 MB/s 512 MB/s 1 GB/s 2 GB/sOpenMPI x2 512 MB/s 1 GB/s 2 GB/s 4 GB/sAn open source → MPI-2 implementation that is developed andmaintained by a consortium of academic, research, and indus- x4 1 GB/s 2 GB/s 4 GB/s 8 GB/stry partners. x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s x16 4 GB/s 8 GB/s 16 GB/s 32 GB/sA distributed filesystem like → Lustre or → parallel NFS, where asingle storage namespace is spread over several storage devices x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s 172
  • (one in each direction) make up lanes; rather than a shared PCI-SIG (PCI Special Interest Group). In PCIe 1.x, each lane carriesparallel bus. These lanes are routed by a hub on the main- approximately 250 MB/s. PCIe 2.0, released in late 2007, adds aboard acting as a crossbar switch. This dynamic point-to-point Gen2-signalling mode, doubling the rate to about 500 MB/s. Onbehavior allows more than one pair of devices to communi- November 18, 2010, the PCI Special Interest Group officially pub-cate with each other at the same time. In contrast, older PC lishes the finalized PCI Express 3.0 specification to its membersinterfaces had all devices permanently wired to the same bus; to build devices based on this new version of PCI Express, whichtherefore, only one device could send information at a time. allows for a Gen3-signalling mode at 1 GB/s.This format also allows “channel grouping”, where multiplelanes are bonded to a single device pair in order to provide On November 29, 2011, PCI-SIG has annonced to proceed tohigher bandwidth. The number of lanes is “negotiated” during PCI Express 4.0 featuring 16 GT/s, still on copper technology.power-up or explicitly during operation. By making the lane Additionally, active and idle power optimizations are to becount flexible a single standard can provide for the needs of investigated. Final specifications are expected to be released inhigh-bandwidth cards (e.g. graphics cards, 10 Gigabit Ethernet 2014/15.cards and multiport Gigabit Ethernet cards) while also beingeconomical for less demanding cards. PETSc (“Portable, Extensible Toolkit for Scientific Computation”) A suite of data structures and routines for the scalable (paral-Unlike preceding PC expansion interface standards, PCIe is a lel) solution of scientific applications modeled by partialnetwork of point-to-point connections. This removes the need differential equations. It employs the → Message Passingfor “arbitrating” the bus or waiting for the bus to be free and Interface (MPI) standard for all message-passing communica-allows for full duplex communications. This means that while tion. The current version of PETSc is 3.2; released September 8,standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the 2011. PETSc is intended for use in large-scale application proj-same data transfer rate, PCIe x4 will give better performance ects, many ongoing computational science projects are builtif multiple device pairs are communicating simultaneously or around the PETSc libraries. Its careful design allows advancedif communication within a single device pair is bidirectional. users to have detailed control over the solution process. PETScSpecifications of the format are maintained and developed by a includes a large suite of parallel linear and nonlinear equa-group of more than 900 industry-leading companies called the tion solvers that are easily used in application codes written 173
  • GLOSSARYin C, C++, Fortran and now Python. PETSc provides many of the plication memory, eliminating the need to copy data betweenmechanisms needed within parallel application code, such as application memory and the data buffers in the operatingsimple parallel matrix and vector assembly routines that allow system. Such transfers require no work to be done by CPUs,the overlap of communication and computation. In addition, caches, or context switches, and transfers continue in parallelPETSc includes support for parallel distributed arrays useful with other system operations. When an application performsfor finite difference methods. an RDMA Read or Write request, the application data is deliv- ered directly to the network, reducing latency and enablingprocess fast message transfer. Common RDMA implementations→ thread include → InfiniBand, → iSER, and → iWARP.PTX (“parallel thread execution”) RISC (“reduced instruction-set computer”)Parallel Thread Execution (PTX) is a pseudo-assembly language A CPU design strategy emphasizing the insight that simplifiedused in NVIDIA’s CUDA programming environment. The ‘nvcc’ instructions that “do less“ may still provide for higher perfor-compiler translates code written in CUDA, a C-like language, mance if this simplicity can be utilized to make instructionsinto PTX, and the graphics driver contains a compiler which execute very quickly → CISC.translates the PTX into something which can be run on theprocessing cores. ScaLAPACK (“scalable LAPACK”) Library including a subset of → LAPACK routines redesignedRDMA (“remote direct memory access”) for distributed memory MIMD (multiple instruction, multipleAllows data to move directly from the memory of one com- data) parallel computers. It is currently written in a Single-puter into that of another without involving either one‘s Program-Multiple-Data style using explicit message passingoperating system. This permits high-throughput, low-latency for interprocessor communication. ScaLAPACK is designednetworking, which is especially useful in massively parallel for heterogeneous computing and is portable on any com-computer clusters. RDMA relies on a special philosophy in puter that supports → MPI. The fundamental building blocksusing DMA. RDMA supports zero-copy networking by enabling of the ScaLAPACK library are distributed memory versionsthe network adapter to transfer data directly to or from ap- (PBLAS) of the Level 1, 2 and 3 → BLAS, and a set of Basic 174
  • Linear Algebra Communication Subprograms (BLACS) for SMP (“symmetric multiprocessing”)communication tasks that arise frequently in parallel linear A multiprocessor or multicore computer architecture wherealgebra computations. In the ScaLAPACK routines, all inter- two or more identical processors or cores can connect to aprocessor communication occurs within the PBLAS and the single shared main memory in a completely symmetric way,BLACS. One of the design goals of ScaLAPACK was to have the i.e. each part of the main memory has the same distance toScaLAPACK routines resemble their → LAPACK equivalents as each of the cores. Opposite: → NUMAmuch as possible. storage access protocolservice-oriented architecture (SOA) Part of the → parallel NFS standardAn approach to building distributed, loosely coupled applica-tions in which functions are separated into distinct services STREAMthat can be distributed over a network, combined, and reused. A simple synthetic benchmark program that measures sustain-See the main article “Windows HPC Server 2008 R2” for further able memory bandwidth (in MB/s) and the correspondingdetails. computation rate for simple vector kernels.single precision/double precision streaming multiprocessor (SM)→ floating point standard Hardware component within the → Tesla GPU seriesSMP (“shared memory processing”) subnet managerSo-called SMP jobs are computer programs with several parts Application responsible for configuring the local → InfiniBandrunning on the same system and accessing a shared memory subnet and ensuring its continued operation.region. A usual implementation of SMP jobs is→ multi-threaded programs. The communication between the superscalar processorssingle threads can e.g. be realized by the → OpenMP software A superscalar CPU architecture implements a form of paral-interface standard, but also in a non-standard way by means of lelism called instruction-level parallelism within a singlenative UNIX interprocess communication mechanisms. processor. It thereby allows faster CPU throughput than would 175
  • GLOSSARYotherwise be possible at the same clock rate. A superscalar cess. On a single processor, multithreading generally occurs byprocessor executes more than one instruction during a clock multitasking: the processor switches between different threads.cycle by simultaneously dispatching multiple instructions to On a multiprocessor or multi-core system, the threads or tasksredundant functional units on the processor. Each functional will generally run at the same time, with each processor or coreunit is not a separate CPU core but an execution resource running a particular thread or task. Threads are distinguishedwithin a single CPU such as an arithmetic logic unit, a bit from processes in that processes are typically independent,shifter, or a multiplier. while threads exist as subsets of a process. Whereas processes have separate address spaces, threads share their addressTesla space, which makes inter-thread communication much easierNVIDIA‘s third brand of GPUs, based on high-end GPUs from than classical inter-process communication (IPC).the G80 and on. Tesla is NVIDIA‘s first dedicated GeneralPurpose GPU. Because of the very high computational power thread (in CUDA architecture)(measured in floating point operations per second or FLOPS) Part of the → CUDA programming modelcompared to recent microprocessors, the Tesla products areintended for the HPC market. The primary function of Tesla thread block (in CUDA architecture)products are to aid in simulations, large scale calculations Part of the → CUDA programming model(especially floating-point calculations), and image generationfor professional and scientific fields, with the use of → CUDA. thread processor array (TPA)See the main article “NVIDIA GPU Computing” for further Hardware component within the → Tesla GPU seriesdetails. 10 Gigabit Ethernetthread The fastest of the Ethernet standards, first published in 2002A thread of execution is a fork of a computer program into two as IEEE Std 802.3ae-2002. It defines a version of Ethernet withor more concurrently running tasks. The implementation of a nominal data rate of 10 Gbit/s, ten times as fast as Gigabitthreads and processes differs from one operating system to Ethernet. Over the years several 802.3 standards relating toanother, but in most cases, a thread is contained inside a pro- 10GbE have been published, which later were consolidated into 176
  • the IEEE 802.3-2005 standard. IEEE 802.3-2005 and the otheramendments have been consolidated into IEEE Std 802.3-2008.10 Gigabit Ethernet supports only full duplex links which canbe connected by switches. Half Duplex operation and CSMA/CD (carrier sense multiple access with collision detect) are notsupported in 10GbE. The 10 Gigabit Ethernet standard encom-passes a number of different physical layer (PHY) standards.As of 2008 10 Gigabit Ethernet is still an emerging technologywith only 1 million ports shipped in 2007, and it remains tobe seen which of the PHYs will gain widespread commercialacceptance.warp (in CUDA architecture)Part of the → CUDA programming modelOLAP cubeA set of data, organized in a way that facilitates non-predeter-mined queries for aggregated information, or in other words,online analytical processing. OLAP is one of the computer-basedtechniques for analyzing business data that are collectivelycalled business intelligence. OLAP cubes can be thought of asextensions to the two-dimensional array of a spreadsheet. Forexample a company might wish to analyze some financial databy product, by time-period, by city, by type of revenue and cost,and by comparing actual data with a budget. These additionalmethods of analyzing the data are known as dimensions.Because there can be more than three dimensions in an OLAPsystem the term hypercube is sometimes used. 177
  • Please fill in the postcard on the right, or send an email to YES: I would like more information abouthpc@transtec.de for further information. transtec’s HPC solutions. YES: I would like a transtec HPC specialist to contact me. I am interested in: compute nodes and other hardware cluster and workload management HPC-related services GPU computing parallel filesystems and HPC storage InfiniBand solutions remote postprocessing and visualization HPC databases and Big Data Analytics Company: Department: Name: Address: ZIP, City: Email:
  • transtec AGWaldhoernlestrasse 1872072 TuebingenGermany
  • Risk Analysis Automotive CAD Life Sciences Price Modelling High Throughput Computing Aerospace Engineering Price Modelling Simulation CAE Aerospace High Throughput Computing CAD Life Sciences Risk Analysis Engineering Big Data AnalyticsBig Data Analytics Simulation CAE Automotive
  • Automotive Risk Analysis High Throughput Computing Simulation Engineering CAE Aerospace Big Data Analytics Price ModellingLife Sciences Risk Analysis CAD transtec Germany transtec Switzerland Tel +49 (0) 7071/703-400 Tel +41 (0) 44/818 47 00 transtec@transtec.de transtec.ch@transtec.ch www.transtec.de www.transtec.ch transtec United Kingdom ttec Netherlands transtec France Tel +44 (0) 1295/756 500 Tel +31 (0) 24 34 34 210 Tel +33 (0) transtec.uk@transtec.co.uk ttec@ttec.nl transtec.fr@transtec.fr www.transtec.co.uk www.ttec.nl www.transtec.fr Texts and conception: Dr. Oliver Tennert, Director Technology Management & HPC Solutions | Oliver.Tennert@transtec.de Layout and design: Stefanie Gauger, Graphics & Design | Stefanie.Gauger@transtec.de © transtec AG, June 2012 The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission. No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners.