• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
HPC compass 2013/2014

HPC compass 2013/2014



Our new 186 pages HPC compass 2013/2014 is out!!

Our new 186 pages HPC compass 2013/2014 is out!!



Total Views
Views on SlideShare
Embed Views



1 Embed 29

http://www.scoop.it 29



Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    HPC compass 2013/2014 HPC compass 2013/2014 Document Transcript

    • Simulation CAD CAE Engineering Price Modelling Life Sciences Automotive Aerospace Risk Analysis Big Data Analytics High Throughput Computing HighPerformance Computing2013/14 TechnologyCompass
    • 2 TechnologyCompass Table of contents and Introduction High Performance Computing............................................4 Performance Turns Into Productivity.......................................................6 Flexible Deployment With xcat...................................................................8 Service and Customer Care From A to Z ..............................................10 Advanced Cluster Management Made Easy............ 12 Easy-to-use, Complete and Scalable.......................................................14 Cloud Bursting With Bright..........................................................................18 Intelligent HPC Workload Management.................... 30 Moab HPC Suite – Basic Edition.................................................................32 Moab HPC Suite – Enterprise Edition.....................................................34 Moab HPC Suite – Grid Option....................................................................36 Optimizing Accelerators with Moab HPC Suite...............................38 Moab HPC Suite – Application Portal Edition...................................42 Moab HPC Suite – Remote Visualization Edition............................44 Using Moab With Grid Environments....................................................46 Nice EnginFrame...................................................................... 52 A Technical Portal for Remote Visualization.....................................54 Application Highlights....................................................................................56 Desktop Cloud Virtualization.....................................................................58 Remote Visualization.......................................................................................60 Intel Cluster Ready ................................................................. 64 A Quality Standard for HPC Clusters......................................................66 Intel Cluster Ready builds HPC Momentum......................................70 The transtec Benchmarking Center........................................................74 Parallel NFS.................................................................................. 76 The New Standard for HPC Storage........................................................78 Whats´s New in NFS 4.1?...............................................................................80 Panasas HPC Storage.......................................................................................82 NVIDIA GPU Computing........................................................ 94 What is GPU Computing?..............................................................................96 Kepler GK110 – The Next Generation GPU..........................................98 Introducing NVIDIA Parallel Nsight...................................................... 100 A Quick Refresher on CUDA....................................................................... 112 Intel TrueScale InfiniBand and GPUs................................................... 118 Intel Xeon Phi Coprocessor..............................................122 The Architecture.............................................................................................. 124 InfiniBand....................................................................................134 High-speed Interconnects......................................................................... 136 Top 10 Reasons to Use Intel TrueScale InfiniBand...................... 138 Intel MPI Library 4.0 Performance......................................................... 140 Intel Fabric Suite 7.......................................................................................... 144 Parstream....................................................................................150 Big Data Analytics........................................................................................... 152 Glossary........................................................................................160
    • 3 More than 30 years of experience in Scientific Computing 1980 marked the beginning of a decade where numerous start- ups were created, some of which later transformed into big play- ers in the IT market. Technical innovations brought dramatic changes to the nascent computer market. In Tübingen, close to one of Germany’s prime and oldest universities, transtec was founded. In the early days, transtec focused on reselling DEC computers and peripherals, delivering high-performance workstations to university institutes and research facilities. In 1987, SUN/Sparc and storage solutions broadened the portfolio, enhanced by IBM/RS6000 products in 1991. These were the typical worksta- tions and server systems for high performance computing then, used by the majority of researchers worldwide. In the late 90s, transtec was one of the first companies to offer highly customized HPC cluster solutions based on standard Intel architecture servers, some of which entered the TOP 500 list of the world’s fastest computing systems. Thus, given this background and history, it is fair to say that transtec looks back upon a more than 30 years’ experience in sci- entific computing; our track record shows more than 500 HPC in- stallations. With this experience, we know exactly what custom- ers’ demands are and how to meet them. High performance and ease of management – this is what customers require today. HPC systems are for sure required to peak-perform, as their name in- dicates, but that is not enough: they must also be easy to handle. Unwieldy design and operational complexity must be avoided or at least hidden from administrators and particularly users of HPC computer systems. This brochure focusses on where transtec HPC solutions excel. transtec HPC solutions use the latest and most innovative tech- nology. Bright Cluster Manager as the technology leader for uni- fied HPC cluster management, leading-edge Moab HPC Suite for job and workload management, Intel Cluster Ready certification as an independent quality standard for our systems, Panasas HPC storage systems for highest performance and real ease of management required of a reliable HPC storage system. Again, with these components, usability, reliability and ease of man- agement are central issues that are addressed, even in a highly heterogeneous environment. transtec is able to provide custom- ers with well-designed, extremely powerful solutions for Tesla GPU computing, as well as thoroughly engineered Intel Xeon Phi systems. Intel’s InfiniBand Fabric Suite makes managing a large InfiniBand fabric easier than ever before – transtec mas- terly combines excellent and well-chosen components that are already there to a fine-tuned, customer-specific, and thoroughly designed HPC solution. Your decision for a transtec HPC solution means you opt for most intensive customer care and best service in HPC. Our ex- perts will be glad to bring in their expertise and support to assist you at any stage, from HPC design to daily cluster operations, to HPC Cloud Services. Last but not least, transtec HPC Cloud Services provide custom- ers with the possibility to have their jobs run on dynamically pro- vided nodes in a dedicated datacenter, professionally managed and individually customizable. Numerous standard applications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like Gro- macs, NAMD, VMD, and others are pre-installed, integrated into an enterprise-ready cloud management environment, and ready to run. Have fun reading the transtec HPC Compass 2013/14!
    • HighPerformanceCom- puting–Performance TurnsIntoProductivity
    • 55 High Performance Comput- ing (HPC) has been with us from the very beginning of the computer era. High- performance computers were built to solve numer- ous problems which the “human computers” could not handle. The term HPC just hadn’t been coined yet. More important, some of the early principles have changed fundamentally.
    • 6 Performance Turns Into Productivity HPC systems in the early days were much different from those we see today. First, we saw enormous mainframes from large computer manufacturers, including a proprietary operating system and job management system. Second, at universities and research institutes, workstations made inroads and sci- entists carried out calculations on their dedicated Unix or VMS workstations. In either case, if you needed more comput- ing power, you scaled up, i.e. you bought a bigger machine. Today the term High-Performance Computing has gained a fundamentally new meaning. HPC is now perceived as a way to tackle complex mathematical, scientific or engineering problems. The integration of industry standard, “off-the-shelf” server hardware into HPC clusters facilitates the construction of computer networks of such power that one single system could never achieve. The new paradigm for parallelization is scaling out. Computer-supported simulations of realistic processes (so- called Computer Aided Engineering – CAE) has established itself as a third key pillar in the field of science and research alongside theory and experimentation. It is nowadays incon- ceivable that an aircraft manufacturer or a Formula One rac- ing team would operate without using simulation software. And scientific calculations, such as in the fields of astrophys- ics, medicine, pharmaceuticals and bio-informatics, will to a large extent be dependent on supercomputers in the future. Software manufacturers long ago recognized the benefit of high-performance computers based on powerful standard servers and ported their programs to them accordingly. The main advantages of scale-out supercomputers is just that: they are infinitely scalable, at least in principle. Since they are based on standard hardware components, such a supercom- puter can be charged with more power whenever the com- putational capacity of the system is not sufficient any more, simply by adding additional nodes of the same kind. A cum- bersome switch to a different technology can be avoided in most cases. HighPerformanceComputing Performance Turns Into Productivity “transtec HPC solutions are meant to provide cus- tomers with unparalleled ease-of-management and ease-of-use. Apart from that, deciding for a transtec HPC solution means deciding for the most intensive customer care and the best service imaginable.” Dr. Oliver Tennert Director Technology Management & HPC Solutions
    • 7 Variations on the theme: MPP and SMP Parallel computations exist in two major variants today. Ap- plications running in parallel on multiple compute nodes are frequently so-called Massively Parallel Processing (MPP) appli- cations. MPP indicates that the individual processes can each utilize exclusive memory areas. This means that such jobs are predestined to be computed in parallel, distributed across the nodes in a cluster. The individual processes can thus utilize the separate units of the respective node – especially the RAM, the CPU power and the disk I/O. Communication between the individual processes is imple- mented in a standardized way through the MPI software inter- face (Message Passing Interface), which abstracts the underlying network connections between the nodes from the processes. However, the MPI standard (current version 2.0) merely requires source code compatibility, not binary compatibility, so an off- the-shelf application usually needs specific versions of MPI libraries in order to run. Examples of MPI implementations are OpenMPI, MPICH2, MVAPICH2, Intel MPI or – for Windows clusters – MS-MPI. If the individual processes engage in a large amount of commu- nication, the response time of the network (latency) becomes im- portant. Latency in a Gigabit Ethernet or a 10GE network is typi- cally around 10 µs. High-speed interconnects such as InfiniBand, reduce latency by a factor of 10 down to as low as 1 µs. Therefore, high-speed interconnects can greatly speed up total processing. The other frequently used variant is called SMP applications. SMP, in this HPC context, stands for Shared Memory Processing. It involves the use of shared memory areas, the specific imple- mentation of which is dependent on the choice of the underly- ing operating system. Consequently, SMP jobs generally only run on a single node, where they can in turn be multi-threaded and thus be parallelized across the number of CPUs per node. For many HPC applications, both the MPP and SMP variant can be chosen. Many applications are not inherently suitable for parallel execu- tion. In such a case, there is no communication between the in- dividual compute nodes, and therefore no need for a high-speed network between them; nevertheless, multiple computing jobs can be run simultaneously and sequentially on each individual node, depending on the number of CPUs. In order to ensure optimum computing performance for these applications, it must be examined how many CPUs and cores de- liver the optimum performance. We find applications of this sequential type of work typically in the fields of data analysis or Monte-Carlo simulations.
    • 8 xCAT as a Powerful and Flexible Deployment Tool xCAT (Extreme Cluster Administration Tool) is an open source toolkit for the deployment and low-level administration of HPC cluster environments, small as well as large ones. xCAT provides simple commands for hardware control, node discovery, the collection of MAC addresses, and the node de- ployment with (diskful) or without local (diskless) installation. The cluster configuration is stored in a relational database. Node groups for different operating system images can be defined. Also, user-specific scripts can be executed automati- cally at installation time. xCAT Provides the Following Low-Level Administrative Fea- tures ʎʎ Remote console support ʎʎ Parallel remote shell and remote copy commands ʎʎ Plugins for various monitoring tools like Ganglia or Nagios ʎʎ Hardware control commands for node discovery, collect- ing MAC addresses, remote power switching and resetting of nodes ʎʎ Automatic configuration of syslog, remote shell, DNS, DHCP, and ntp within the cluster ʎʎ Extensive documentation and man pages For cluster monitoring, we install and configure the open source tool Ganglia or the even more powerful open source HighPerformanceComputing Flexible Deployment With xCAT
    • 9 solution Nagios, according to the customer’s preferences and requirements. Local Installation or Diskless Installation We offer a diskful or a diskless installation of the cluster nodes. A diskless installation means the operating system is hosted partially within the main memory, larger parts may or may not be included via NFS or other means. This approach allows for deploying large amounts of nodes very efficiently, and the cluster is up and running within a very small times- cale. Also, updating the cluster can be done in a very efficient way. For this, only the boot image has to be updated, and the nodes have to be rebooted. After this, the nodes run either a new kernel or even a new operating system. Moreover, with this approach, partitioning the cluster can also be very effi- ciently done, either for testing purposes, or for allocating dif- ferent cluster partitions for different users or applications. Development Tools, Middleware, and Applications According to the application, optimization strategy, or under- lying architecture, different compilers lead to code results of very different performance. Moreover, different, mainly com- mercial, applications, require different MPI implementations. And even when the code is self-developed, developers often prefer one MPI implementation over another. According to the customer’s wishes, we install various compil- ers, MPI middleware, as well as job management systems like Parastation, Grid Engine, Torque/Maui, or the very powerful Moab HPC Suite for the high-level cluster management.
    • 10 HighPerformanceComputing Services and Customer Care from A to Z transtec HPC as a Service You will get a range of applications like LS-Dyna, ANSYS, Gromacs, NAMD etc. from all kinds of areas pre-installed, integrated into an enterprise-ready cloud and workload management system, and ready-to run. Do you miss your application? Ask us: HPC@transtec.de transtec Platform as a Service You will be provided with dynamically provided compute nodes for running your individual code. The operating system will be pre-installed according to your require- ments. Common Linux distributions like RedHat, CentOS, or SLES are the standard. Do you need another distribu- tion? Ask us: HPC@transtec.de transtec Hosting as a Service You will be provided with hosting space inside a profes- sionally managed and secured datacenter where you can have your machines hosted, managed, maintained, according to your requirements. Thus, you can build up your own private cloud. What range of hosting and main- tenance services do you need? Tell us: HPC@transtec.de HPC @ transtec: Services and Customer Care from A to Z transtec AG has over 30 years of experience in scientific computing and is one of the earliest manufacturers of HPC clusters. For nearly a decade, transtec has delivered highly customized High Perfor- manceclustersbasedonstandardcomponentstoacademicandin- dustry customers across Europe with all the high quality standards and the customer-centric approach that transtec is well known for. Every transtec HPC solution is more than just a rack full of hard- ware – it is a comprehensive solution with everything the HPC user, owner,andoperatorneed.Intheearlystagesofanycustomer’sHPC project, transtec experts provide extensive and detailed consult- ing to the customer – they benefit from expertise and experience. Consulting is followed by benchmarking of different systems with either specifically crafted customer code or generally accepted individual Presales consulting application-, customer-, site-specific sizing of HPC solution burn-in tests of systems benchmarking of different systems continual improvement software & OS installation application installation onsite hardware assembly integration into customer’s environment customer training maintenance, support & managed services individual Presales consulting application-, customer-, site-specific sizing of HPC solution burn-in tests of systems benchmarking of different systems continual improvement software & OS installation application installation onsite hardware assembly integration into customer’s environment customer training maintenance, support & managed services Services and Customer Care from A to Z
    • 11 benchmarking routines; this aids customers in sizing and devising the optimal and detailed HPC configuration. Each and every piece of HPC hardware that leaves our factory un- dergoes a burn-in procedure of 24 hours or more if necessary. We make sure that any hardware shipped meets our and our custom- ers’ quality requirements. transtec HPC solutions are turnkey solu- tions.Bydefault,atranstecHPCclusterhaseverythinginstalledand configured – from hardware and operating system to important middleware components like cluster management or developer tools and the customer’s production applications. Onsite delivery means onsite integration into the customer’s production environ- ment, be it establishing network connectivity to the corporate net- work, or setting up software and configuration parts. transtec HPC clusters are ready-to-run systems – we deliver, you turn the key, the system delivers high performance. Every HPC proj- ect entails transfer to production: IT operation processes and poli- ciesapplytothenewHPCsystem.Effectively,ITpersonnelistrained hands-on, introduced to hardware components and software, with all operational aspects of configuration management. transtec services do not stop when the implementation projects ends. Beyond transfer to production, transtec takes care. transtec offers a variety of support and service options, tailored to the cus- tomer’s needs. When you are in need of a new installation, a major reconfiguration or an update of your solution – transtec is able to support your staff and, if you lack the resources for maintaining the cluster yourself, maintain the HPC solution for you. From Pro- fessional Services to Managed Services for daily operations and required service levels, transtec will be your complete HPC service and solution provider. transtec’s high standards of performance, re- liability and dependability assure your productivity and complete satisfaction. transtec’s offerings of HPC Managed Services offer customers the possibility of having the complete management and administra- tion of the HPC cluster managed by transtec service specialists, in an ITIL compliant way. Moreover, transtec’s HPC on Demand servic- es help provide access to HPC resources whenever they need them, for example, because they do not have the possibility of owning and running an HPC cluster themselves, due to lacking infrastruc- ture, know-how, or admin staff. transtec HPC Cloud Services Lastbutnotleasttranstec’sservicesportfolioevolvesascustomers‘ demands change. Starting this year, transtec is able to provide HPC Cloud Services. transtec uses a dedicated datacenter to provide computing power to customers who are in need of more capacity than they own, which is why this workflow model is sometimes called computing-on-demand. With these dynamically provided resources, customers with the possibility to have their jobs run on HPC nodes in a dedicated datacenter, professionally managed and secured, and individually customizable. Numerous standard ap- plications like ANSYS, LS-Dyna, OpenFOAM, as well as lots of codes like Gromacs, NAMD, VMD, and others are pre-installed, integrated intoanenterprise-readycloudandworkloadmanagementenviron- ment, and ready to run. Alternatively, whenever customers are in need of space for hosting their own HPC equipment because they do not have the space ca- pacity or cooling and power infrastructure themselves, transtec is also able to provide Hosting Services to those customers who’d like to have their equipment professionally hosted, maintained, and managed. Customers can thus build up their own private cloud! Are you interested in any of transtec’s broad range of HPC related services? Write us an email to HPC@transtec.de. We’ll be happy to hear from you!
    • AdvancedCluster Management MadeEasy
    • 13 Bright Cluster Manager re- moves the complexity from the installation, manage- ment and use of HPC clus- ters — onpremise or in the cloud. With Bright Cluster Man- ager, you can easily install, manage and use multiple clusters simultaneously, in- cluding compute, Hadoop, storage, database and work- station clusters.
    • 14 The Bright Advantage Bright Cluster Manager delivers improved productivity, in- creased uptime, proven scalability and security, while reduc- ing operating cost: Rapid Productivity Gains ʎʎ Short learning curve: intuitive GUI drives it all ʎʎ Quick installation: one hour from bare metal to compute- ready ʎʎ Fast, flexible provisioning: incremental, live, disk-full, disk-less, over InfiniBand, to virtual machines, auto node discovery ʎʎ Comprehensive monitoring: on-the-fly graphs, Rackview, mul- tiple clusters, custom metrics ʎʎ Powerful automation: thresholds, alerts, actions ʎʎ Complete GPU support: NVIDIA, AMD1, CUDA, OpenCL ʎʎ Full support for Intel Xeon Phi ʎʎ On-demand SMP: instant ScaleMP virtual SMP deployment ʎʎ Fast customization and task automation: powerful cluster management shell, SOAP and JSON APIs make it easy ʎʎ Seamless integration with leading workload managers: Slurm, Open Grid Scheduler, Torque, openlava, Maui2, PBS Profes- sional, Univa Grid Engine, Moab2, LSF ʎʎ Integrated (parallel) application development environment ʎʎ Easy maintenance: automatically update your cluster from Linux and Bright Computing repositories ʎʎ Easily extendable, web-based User Portal ʎʎ Cloud-readiness at no extra cost3, supporting scenarios “Clus- ter-on-Demand” and “Cluster-Extension”, with data-aware scheduling ʎʎ Deploys, provisions, monitors and manages Hadoop clusters ʎʎ Future-proof: transparent customization minimizes disrup- tion from staffing changes Maximum Uptime ʎʎ Automatic head node failover: prevents system downtime ʎʎ Powerful cluster automation: drives pre-emptive actions based on monitoring thresholds AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable The cluster installer takes you through the installation process and offers advanced options such as “Express” and “Remote”.
    • 15 ʎʎ Comprehensive cluster monitoring and health checking: au- tomatic sidelining of unhealthy nodes to prevent job failure Scalability from Deskside to TOP500 ʎʎ Off-loadable provisioning: enables maximum scalability ʎʎ Proven: used on some of the world’s largest clusters Minimum Overhead / Maximum Performance ʎʎ Lightweight: single daemon drives all functionality ʎʎ Optimized: daemon has minimal impact on operating system and applications ʎʎ Efficient: single database for all metric and configuration data Top Security ʎʎ Key-signed repositories: controlled, automated security and other updates ʎʎ Encryptionoption:forexternalandinternalcommunications ʎʎ Safe: X509v3 certificate-based public-key authentication ʎʎ Sensible access: role-based access control, complete audit trail ʎʎ Protected: firewalls and LDAP A Unified Approach Bright Cluster Manager was written from the ground up as a to- tally integrated and unified cluster management solution. This fundamental approach provides comprehensive cluster manage- ment that is easy to use and functionality-rich, yet has minimal impact on system performance. It has a single, light-weight daemon, a central database for all monitoring and configura- tion data, and a single CLI and GUI for all cluster management functionality. Bright Cluster Manager is extremely easy to use, scalable, secure and reliable. You can monitor and manage all aspects of your clusters with virtually no learning curve. Bright’s approach is in sharp contrast with other cluster man- agement offerings, all of which take a “toolkit” approach. These toolkits combine a Linux distribution with many third-party tools for provisioning, monitoring, alerting, etc. This approach has critical limitations: these separate tools were not designed to work together; were often not designed for HPC, nor designed to scale. Furthermore, each of the tools has its own interface (mostly command-line based), and each has its own daemon(s) and database(s). Countless hours of scripting and testing by highly skilled people are required to get the tools to work for a specific cluster, and much of it goes undocumented. Time is wasted, and the cluster is at risk if staff changes occur, losing the “in-head” knowledge of the custom scripts. By selecting a cluster node in the tree on the left and the Tasks tab on the right, you can execute a number of powerful tasks on that node with just a single mouse click.
    • 1616 Ease of Installation Bright Cluster Manager is easy to install. Installation and testing of a fully functional cluster from “bare metal” can be complet- ed in less than an hour. Configuration choices made during the installation can be modified afterwards. Multiple installation modes are available, including unattended and remote modes. Cluster nodes can be automatically identified based on switch ports rather than MAC addresses, improving speed and reliabil- ity of installation, as well as subsequent maintenance. All major hardware brands are supported: Dell, Cray, Cisco, DDN, IBM, HP, Supermicro, Acer, Asus and more. Ease of Use Bright Cluster Manager is easy to use, with two interface op- tions: the intuitive Cluster Management Graphical User Interface (CMGUI) and the powerful Cluster Management Shell (CMSH). The CMGUI is a standalone desktop application that provides a single system view for managing all hardware and software aspects of the cluster through a single point of control. Admin- istrative functions are streamlined as all tasks are performed through one intuitive, visual interface. Multiple clusters can be managed simultaneously. The CMGUI runs on Linux, Windows and OS X, and can be extended using plugins. The CMSH provides practically the same functionality as the CMGUI, but via a com- mand-line interface. The CMSH can be used both interactively and in batch mode via scripts. Either way, you now have unprecedented flexibility and control over your clusters. Support for Linux and Windows Bright Cluster Manager is based on Linux and is available with a choice of pre-integrated, pre-configured and optimized Linux distributions, including SUSE Linux Enterprise Server, Red Hat Enterprise Linux, CentOS and Scientific Linux. Dual-boot installa- tions with Windows HPC Server are supported as well, allowing nodes to either boot from the Bright-managed Linux head node, or the Windows-managed head node. AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable TheOverviewtabprovidesinstant,high-levelinsightintothestatusof the cluster.
    • 1717 Extensive Development Environment Bright Cluster Manager provides an extensive HPC development environment for both serial and parallel applications, including the following (some are cost options): ʎʎ Compilers, including full suites from GNU, Intel, AMD and Port- land Group. ʎʎ Debuggers and profilers, including the GNU debugger and profiler, TAU, TotalView, Allinea DDT and Allinea MAP. ʎʎ GPU libraries, including CUDA and OpenCL. ʎʎ MPI libraries, including OpenMPI, MPICH, MPICH2, MPICHMX, MPICH2-MX, MVAPICH and MVAPICH2; all cross-compiled with the compilers installed on Bright Cluster Manager, and opti- mized for high-speed interconnects such as InfiniBand and 10GE. ʎʎ Mathematical libraries, including ACML, FFTW, Goto-BLAS, MKL and ScaLAPACK. ʎʎ Other libraries, including Global Arrays, HDF5, IIPP, TBB, Net- CDF and PETSc. Bright Cluster Manager also provides Environment Modules to make it easy to maintain multiple versions of compilers, libraries andapplications fordifferentusers on the cluster, without creating compatibility conflicts. Each Environment Module file contains the information needed to configure the shell for an application, and automatically sets these variables correctly for the particular appli- cationwhenitisloaded.BrightClusterManagerincludesmanypre- configured module files for many scenarios, such as combinations of compliers, mathematical and MPI libraries. Powerful Image Management and Provisioning Bright Cluster Manager features sophisticated software image management and provisioning capability. A virtually unlimited number of images can be created and assigned to as many dif- ferent categories of nodes as required. Default or custom Linux kernels can be assigned to individual images. Incremental changes to images can be deployed to live nodes without re- booting or re-installation. The provisioning system only propagates changes to the images, minimizing time and impact on system performance and avail- ability. Provisioning capability can be assigned to any number of nodes on-the-fly, for maximum flexibility and scalability. Bright Cluster Manager can also provision over InfiniBand and to ram- disk or virtual machine. Comprehensive Monitoring With Bright Cluster Manager, you can collect, monitor, visual- ize and analyze a comprehensive set of metrics. Many software and hardware metrics available to the Linux kernel, and many hardware management interface metrics (IPMI, DRAC, iLO, etc.) are sampled. Cluster metrics, such as GPU, Xeon Phi and CPU temperatures, fan speeds and network statistics can be visualized by simply dragging and dropping them into a graphing window. Multiple metrics can be combined in one graph and graphs can be zoomed into. A Graphing wizard allows creation of all graphs for a selected combination of metrics and nodes. Graph layout and color configurations can be tailored to your requirements and stored for re-use.
    • 18 High performance meets efficiency Initially, massively parallel systems constitute a challenge to both administrators and users. They are complex beasts. Anyone building HPC clusters will need to tame the beast, master the complexity and present users and administrators with an easy- to-use, easy-to-manage system landscape. Leading HPC solution providers such as transtec achieve this goal. They hide the complexity of HPC under the hood and match high performance with efficiency and ease-of-use for both users and administrators. The “P” in “HPC” gains a double meaning: “Performance” plus “Productivity”. Cluster and workload management software like Moab HPC Suite, Bright Cluster Manager or QLogic IFS provide the means to master and hide the inherent complexity of HPC systems. For administrators and users, HPC clusters are presented as single, large machines, with many different tuning parameters. The software also provides a unified view of existing clusters whenever unified management is added as a requirement by the customer at any point in time after the first installation. Thus, daily routine tasks such as job management, user management, queue partitioning and management, can be performed easily with either graphical or web-based tools, without any advanced scripting skills or technical expertise required from the adminis- trator or user. AdvancedClusterManagement MadeEasy Cloud Bursting With Bright
    • 19 Bright Cluster Manager unleashes and manages the unlimited power of the cloud Create a complete cluster in Amazon EC2, or easily extend your onsite cluster into the cloud, enabling you to dynamically add capacity and manage these nodes as part of your onsite cluster. Both can be achieved in just a few mouse clicks, without the need for expert knowledge of Linux or cloud computing. Bright Cluster Manager’s unique data aware scheduling capabil- ity means that your data is automatically in place in EC2 at the start of the job, and that the output data is returned as soon as the job is completed. With Bright Cluster Manager, every cluster is cloud-ready, at no extra cost. The same powerful cluster provisioning, monitoring, scheduling and management capabilities that Bright Cluster Manager provides to onsite clusters extend into the cloud. The Bright advantage for cloud bursting ʎʎ Ease of use: Intuitive GUI virtually eliminates user learning curve; no need to understand Linux or EC2 to manage system. Alternatively, cluster management shell provides powerful scripting capabilities to automate tasks ʎʎ Complete management solution: Installation/initialization, provisioning, monitoring, scheduling and management in one integrated environment ʎʎ Integrated workload management: Wide selection of work- load managers included and automatically configured with local, cloud and mixed queues ʎʎ Singlesystemview;completevisibilityandcontrol:Cloudcom- pute nodes managed as elements of the on-site cluster; visible from a single console with drill-downs and historic data. ʎʎ Efficient data management via data aware scheduling: Auto- matically ensures data is in position at start of computation; delivers results back when complete. ʎʎ Secure, automatic gateway: Mirrored LDAP and DNS services over automatically-created VPN connects local and cloud- based nodes for secure communication. ʎʎ Cost savings: More efficient use of cloud resources; support for spot instances, minimal user intervention. Two cloud bursting scenarios Bright Cluster Manager supports two cloud bursting scenarios: “Cluster-on-Demand” and “Cluster Extension”. Scenario 1: Cluster-on-Demand The Cluster-on-Demand scenario is ideal if you do not have a cluster onsite, or need to set up a totally separate cluster. With just a few mouse clicks you can instantly create a complete clus- ter in the public cloud, for any duration of time. Scenario 2: Cluster Extension TheClusterExtensionscenarioisidealifyouhaveaclusteronsite but you need more compute power, including GPUs. With Bright Cluster Manager, you can instantly add EC2-based resources to
    • 20 your onsite cluster, for any duration of time. Bursting into the cloud is as easy as adding nodes to an onsite cluster — there are only a few additional steps after providing the public cloud ac- count information to Bright Cluster Manager. The Bright approach to managing and monitoring a cluster in the cloud provides complete uniformity, as cloud nodes are man- aged and monitored the same way as local nodes: ʎʎ Load-balanced provisioning ʎʎ Software image management ʎʎ Integrated workload management ʎʎ Interfaces — GUI, shell, user portal ʎʎ Monitoring and health checking ʎʎ Compilers, debuggers, MPI libraries, mathematical libraries and environment modules Bright Cluster Manager also provides additional features that are unique to the cloud. AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable The Add Cloud Provider Wizard and the Node Creation Wizard make the cloud bursting process easy, also for users with no cloud experience.
    • 21 Amazon spot instance support Bright Cluster Manager enables users to take advantage of the cost savings offered by Amazon’s spot instances. Users can spec- ify the use of spot instances, and Bright will automatically sched- ule as available, reducing the cost to compute without the need to monitor spot prices and manually schedule. Hardware Virtual Machine (HVM) virtualization Bright Cluster Manager automatically initializes all Amazon instance types, including Cluster Compute and Cluster GPU in- stances that rely on HVM virtualization. Data aware scheduling Data aware scheduling ensures that input data is transferred to the cloud and made accessible just prior to the job starting, and that the output data is transferred back. There is no need to wait (and monitor) for the data to load prior to submitting jobs (de- laying entry into the job queue), nor any risk of starting the job before the data transfer is complete (crashing the job). Users sub- mit their jobs, and Bright’s data aware scheduling does the rest. Bright Cluster Manager provides the choice: “To Cloud, or Not to Cloud” Not all workloads are suitable for the cloud. The ideal situation for most organizations is to have the ability to choose between onsite clusters for jobs that require low latency communication, complex I/O or sensitive data; and cloud clusters for many other types of jobs. Bright Cluster Manager delivers the best of both worlds: a pow- erful management solution for local and cloud clusters, with the ability to easily extend local clusters into the cloud without compromising provisioning, monitoring or managing the cloud resources. One or more cloud nodes can be configured under the Cloud Settings tab. Bright Computing
    • 22 Examples include CPU, GPU and Xeon Phi temperatures, fan speeds, switches, hard disk SMART information, system load, memory utilization, network metrics, storage metrics, power systems statistics, and workload management metrics. Custom metrics can also easily be defined. Metric sampling is done very efficiently — in one process, or out- of-band where possible. You have full flexibility over how and when metrics are sampled, and historic data can be consolidat- ed over time to save disk space. Cluster Management Automation Cluster management automation takes pre-emptive actions when predetermined system thresholds are exceeded, saving time and preventing hardware damage. Thresholds can be con- figured on any of the available metrics. The built-in configura- tion wizard guides you through the steps of defining a rule: selecting metrics, defining thresholds and specifying actions. For example, a temperature threshold for GPUs can be estab- lished that results in the system automatically shutting down an AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable The parallel shell allows for simultaneous execution of commands or scripts across node groups or across the entire cluster. The status of cluster nodes, switches, other hardware, as well as up to six metrics can be visualized in the Rackview. A zoomout option is available for clusters with many racks.
    • 23 overheated GPU unit and sending a text message to your mobile phone. Several predefined actions are available, but any built-in cluster management command, Linux command or script can be used as an action. Comprehensive GPU Management Bright Cluster Manager radically reduces the time and effort of managing GPUs, and fully integrates these devices into the single view of the overall system. Bright includes powerful GPU management and monitoring capability that leverages function- ality in NVIDIA Tesla and AMD GPUs. You can easily assume maximum control of the GPUs and gain instant and time-based status insight. Depending on the GPU make and model, Bright monitors a full range of GPU metrics, including: ʎʎ GPU temperature, fan speed, utilization ʎʎ GPU exclusivity, compute, display, persistance mode ʎʎ GPU memory utilization, ECC statistics ʎʎ Unit fan speed, serial number, temperature, power usage, vol- tages and currents, LED status, firmware. ʎʎ Board serial, driver version, PCI info. Beyond metrics, Bright Cluster Manager features built-in support for GPU computing with CUDA and OpenCL libraries. Switching between current and previous versions of CUDA and OpenCL has also been made easy. Full Support for Intel Xeon Phi Bright Cluster Manager makes it easy to set up and use the Intel Xeon Phi coprocessor. Bright includes everything that is needed to get Phi to work, including a setup wizard in the CMGUI. Bright ensures that your software environment is set up correctly, so that the Intel Xeon Phi coprocessor is available for applications that are able to take advantage of it. Bright collects and displays a wide range of metrics for Phi, en- suring that the coprocessor is visible and manageable as a de- vice type, as well as including Phi as a resource in the workload management system. Bright’s pre-job health checking ensures that Phi is functioning properly before directing tasks to the coprocessor. Multi-Tasking Via Parallel Shell The parallel shell allows simultaneous execution of multiple commands and scripts across the cluster as a whole, or across easily definable groups of nodes. Output from the executed commands is displayed in a convenient way with variable levels of verbosity. Running commands and scripts can be killed eas- ily if necessary. The parallel shell is available through both the CMGUI and the CMSH. Integrated Workload Management Bright Cluster Manager is integrated with a wide selection of free and commercial workload managers. This integration provides a number of benefits: ʎʎ The selected workload manager gets automatically installed and configured The automation configuration wizard guides you through the steps of defining a rule: selecting metrics, defining thresholds and specifying actions
    • 24 ʎʎ Many workload manager metrics are monitored ʎʎ The CMGUI and User Portal provide a user-friendly interface to the workload manager ʎʎ The CMSH and the SOAP & JSON APIs provide direct and po- werful access to a number of workload manager commands and metrics ʎʎ Reliable workload manager failover is properly configured ʎʎ The workload manager is continuously made aware of the health state of nodes (see section on Health Checking) ʎʎ The workload manager is used to save power through auto- power on/off based on workload ʎʎ The workload manager is used for data-aware scheduling of jobs to the cloud The following user-selectable workload managers are tightly in- tegrated with Bright Cluster Manager: ʎʎ PBS Professional, Univa Grid Engine, Moab, LSF ʎʎ Slurm, openlava, Open Grid Scheduler, Torque, Maui Alternatively, other workload managers, such as LoadLevele and Condor can be installed on top of Bright Cluster Manager. Integrated SMP Support Bright Cluster Manager — Advanced Edition dynamically aggre- gates multiple cluster nodes into a single virtual SMP node, us- ing ScaleMP’s Versatile SMP™ (vSMP) architecture. Creating and dismantling a virtual SMP node can be achieved with just a few AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable Creating and dismantling a virtual SMP node can be achieved with just a few clicks within the GUI or a single command in the cluster management shell. Example graphs that visualize metrics on a GPU cluster.
    • 25 clicks within the CMGUI. Virtual SMP nodes can also be launched and dismantled automatically using the scripting capabilities of the CMSH. In Bright Cluster Manager a virtual SMP node behaves like any other node, enabling transparent, on-the-fly provisioning, con- figuration, monitoring and management of virtual SMP nodes as part of the overall system management. Maximum Uptime with Head Node Failover Bright Cluster Manager — Advanced Edition allows two head nodes to be configured in active-active failover mode. Both head nodes are on active duty, but if one fails, the other takes over all tasks, seamlesly. Maximum Uptime with Health Checking Bright Cluster Manager — Advanced Edition includes a powerful cluster health checking framework that maxi- mizes system uptime. It continually checks multiple health indicators for all hardware and software components and proactively initiates corrective actions. It can also automati- cally perform a series of standard and user-defined tests just before starting a new job, to ensure a successful execution, and preventing the “black hole node syndrome”. Examples of corrective actions include autonomous bypass of faulty nodes, automatic job requeuing to avoid queue flushing, and process “jailing” to allocate, track, trace and flush completed user processes. The health checking framework ensures the highest job throughput, the best overall cluster efficiency and the lowest administration overhead. Top Cluster Security Bright Cluster Manager offers an unprecedented level of security that can easily be tailored to local requirements. Security features include: ʎʎ Automated security and other updates from key-signed Linux and Bright Computing repositories. ʎʎ Encrypted internal and external communications. ʎʎ X509v3 certificate based public-key authentication to the cluster management infrastructure. ʎʎ Role-based access control and complete audit trail. ʎʎ Firewalls, LDAP and SSH. User and Group Management Users can be added to the cluster through the CMGUI or the CMSH. Bright Cluster Manager comes with a pre-configured LDAP database, but an external LDAP service, or alternative au- thentication system, can be used instead. Web-Based User Portal The web-based User Portal provides read-only access to essen- tial cluster information, including a general overview of the clus- ter status, node hardware and software properties, workload manager statistics and user-customizable graphs. The User Por- tal can easily be customized and expanded using PHP and the SOAP or JSON APIs. Workload management queues can be viewed and configured from the GUI, without the need for workload management expertise.
    • 26 Multi-Cluster Capability Bright Cluster Manager — Advanced Edition is ideal for organiza- tions that need to manage multiple clusters, either in one or in multiple locations. Capabilities include: ʎʎ All cluster management and monitoring functionality is available for all clusters through one GUI. ʎʎ Selecting any set of configurations in one cluster and ex- porting them to any or all other clusters with a few mouse clicks. ʎʎ Metric visualizations and summaries across clusters. ʎʎ Making node images available to other clusters. Fundamentally API-Based Bright Cluster Manager is fundamentally API-based, which means that any cluster management command and any piece of cluster management data — whether it is monitoring data or configuration data — is available through the API. Both a SOAP and a JSON API are available and interfaces for various programming languages, including C++, Python and PHP are provided. Cloud Bursting Bright Cluster Manager supports two cloud bursting sce- narios: “Cluster-on-Demand” — running stand-alone clusters AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable “The building blocks for transtec HPC solutions must be chosen according to our goals ease-of- management and ease-of-use. With Bright Cluster Manager, we are happy to have the technology leader at hand, meeting these requirements, and our customers value that.” Armin Jäger HPC Solution Engineer
    • 27 in the cloud; and “Cluster Extension” — adding cloud-based resources to existing, onsite clusters and managing these cloud nodes as if they were local. In addition, Bright provides data aware scheduling to ensure that data is accessible in the cloud at the start of jobs, and results are promptly trans- ferred back. Both scenarios can be achieved in a few simple steps. Every Bright cluster is automatically cloud-ready, at no extra cost. Hadoop Cluster Management Bright Cluster Manager is the ideal basis for Hadoop clusters. Bright installs on bare metal, configuring a fully operational Hadoop cluster in less than one hour. In the process, Bright pre- pares your Hadoop cluster for use by provisioning the operating system and the general cluster management and monitoring ca- pabilities required as on any cluster. Bright then manages and monitors your Hadoop cluster’s hardware and system software throughout its life-cycle, collecting and graphically displaying a full range of Hadoop metrics from the HDFS, RPC and JVM sub-systems. Bright sig- nificantly reduces setup time for Cloudera, Hortonworks and other Hadoop stacks, and increases both uptime and Map- Reduce job throughput. This functionality is scheduled to be further enhanced in upcoming releases of Bright, including dedicated manage- ment roles and profiles for name nodes, data nodes, as well as advanced Hadoop health checking and monitoring func- tionality. Standard and Advanced Editions Bright Cluster Manager is available in two editions: Standard and Advanced. The table on this page lists the differences. You can easily upgrade from the Standard to the Advanced Edition as your cluster grows in size or complexity. The web-based User Portal provides read-only access to essential cluster in- formation, including a general overview of the cluster status, node hardware and software properties, workload manager statistics and user-customizable graphs. Bright Cluster Manager can manage multiple clusters simultaneously. This overview shows clusters in Oslo, Abu Dhabi and Houston, all managed through one GUI.
    • 28 Documentation and Services A comprehensive system administrator manual and user man- ual are included in PDF format. Standard and tailored services are available, including various levels of support, installation, training and consultancy. AdvancedClusterManagement MadeEasy Easy-to-use, complete and scalable Feature Standard Advanced Choice of Linux distributions x x Intel Cluster Ready x x Cluster Management GUI x x Cluster Management Shell x x Web-Based User Portal x x SOAP API x x Node Provisioning x x Node Identification x x Cluster Monitoring x x Cluster Automation x x User Management x x Role-based Access Control x x Parallel Shell x x Workload Manager Integration x x Cluster Security x x Compilers x x Debuggers & Profilers x x MPI Libraries x x Mathematical Libraries x x Environment Modules x x Cloud Bursting x x Hadoop Management & Monitoring x x NVIDIA CUDA & OpenCL - x GPU Management & Monitoring - x Xeon Phi Management & Monitoring - x ScaleMP Management & Monitoring - x Redundant Failover Head Nodes - x Cluster Health Checking - x Off-loadable Provisioning - x Multi-Cluster Management - x Suggested Number of Nodes 4–128 129–10,000+ Standard Support x x Premium Support Optional Optional Cluster health checks can be visualized in the Rackview. This screenshot shows that GPU unit 41 fails a health check called “AllFansRunning”.
    • 29
    • 31 Tandis que tous les systèmes HPC se trouvent confrontés à de grands challenges en matière de complexité, d’évolutivité et d’exigences des flux applicatifs et des ressources, les systèmes HPC des entreprises font face aux défis et attentes les plus exigeants. Les systèmes HPC d’entreprises doivent satisfaire aux exigences critiques et prioritaires des flux d’applications pour les entreprises commerciales et les organisations de recherche et académiques à orientation commerciale. Ils doi- vent trouver un équilibre entre des SLA et des priorités com- plexes. En effet, leurs flux HPC ont un impact direct sur leurs revenus, la livraison de leurs produits et leurs objectifs. 31 While all HPC systems face challenges in workload de- mand, resource complexity, and scale, enterprise HPC systems face more strin- gent challenges and expec- tations. Enterprise HPC systems must meet mission-critical and priority HPC workload demands for commercial businesses and business-ori- ented research and academ- ic organizations. They have complex SLAs and priorities to balance. Their HPC work- loads directly impact the revenue, product delivery, and organizational objec- tives of their organizations.
    • 32 Enterprise HPC Challenges Enterprise HPC systems must eliminate job delays and failures. They are also seeking to improve resource utilization and man- agement efficiency across multiple heterogeneous systems. To- maximize user productivity, they are required to make it easier to access and use HPC resources for users or even expand to other clusters or HPC cloud to better handle workload demand surges. Intelligent Workload Management As today’s leading HPC facilities move beyond petascale towards exascale systems that incorporate increasingly sophisticated and specialized technologies, equally sophisticated and intelligent management capabilities are essential. With a proven history of managing the most advanced, diverse, and data-intensive sys- tems in the Top500, Moab continues to be the preferred workload management solution for next generation HPC facilities. IntelligentHPC WorkloadManagement Moab HPC Suite – Basic Edition Moab is the “brain” of an HPC system, intelligently optimizing workload throughput while balancing service levels and priorities.
    • 33 Moab HPC Suite – Basic Edition Moab HPC Suite - Basic Edition is an intelligent workload man- agement solution. It automates the scheduling, managing, monitoring, and reporting of HPC workloads on massive scale, multi-technology installations. The patented Moab intelligence engine uses multi-dimensional policies to accelerate running workloads across the ideal combination of diverse resources. These policies balance high utilization and throughput goals with competing workload priorities and SLA requirements. The speed and accuracy of the automated scheduling decisions optimizes workload throughput and resource utilization. This gets more work accomplished in less time, and in the right prio- rity order. Moab HPC Suite - Basic Edition optimizes the value and satisfaction from HPC systems while reducing manage- ment cost and complexity. Moab HPC Suite – Enterprise Edition MoabHPCSuite-EnterpriseEditionprovidesenterprise-readyHPC workload management that accelerates productivity, automates workload uptime, and consistently meets SLAs and business prior- ities for HPC systems and HPC cloud. It uses the battle-tested and patented Moab intelligence engine to automatically balance the complex, mission-critical workload priorities of enterprise HPC systems. Enterprise customers benefit from a single integrated product that brings together key enterprise HPC capabilities, im- plementation services, and 24x7 support. This speeds the realiza- tion of benefits from the HPC system for the business including: ʎʎ Higher job throughput ʎʎ Massive scalability for faster response and system expansion ʎʎ Optimum utilization of 90-99% on a consistent basis ʎʎ Fast, simple job submission and management to increase
    • 34 productivity ʎʎ Reduced cluster management complexity and support costs across heterogeneous systems ʎʎ Reduced job failures and auto-recovery from failures ʎʎ SLAs consistently met for improved user satisfaction ʎʎ Reduce power usage and costs by 10-30% Productivity Acceleration Moab HPC Suite – Enterprise Edition gets more results deliv- ered faster from HPC resources at lower costs by accelerating overall system, user and administrator productivity. Moab pro- vides the scalability, 90-99 percent utilization, and simple job submission that is required to maximize the productivity of enterprise HPC systems. Enterprise use cases and capabilities include: ʎʎ Massive multi-point scalability to accelerate job response and throughput, including high throughput computing ʎʎ Workload-optimized allocation policies and provisioning to get more results out of existing heterogeneous resources and reduce costs, including topology-based allocation ʎʎ Unify workload management cross heterogeneous clus- ters to maximize resource availability and administration efficiency by managing them as one cluster ʎʎ Optimized, intelligent scheduling packs workloads and backfills around priority jobs and reservations while balanc- ing SLAs to efficiently use all available resources ʎʎ Optimized scheduling and management of accelerators, both Intel MIC and GPGPUs, for jobs to maximize their utili- zation and effectiveness ʎʎ Simplified job submission and management with advanced job arrays, self-service portal, and templates ʎʎ Administrator dashboards and reporting tools reduce man- agement complexity, time, and costs ʎʎ Workload-aware auto-power management reduces energy use and costs by 10-30 percent Intelligent HPC Workload Management Moab HPC Suite – Enterprise Edition “With Moab HPC Suite, we can meet very demand- ing customers’ requirements as regards unifi ed management of heterogeneous cluster environ- ments, grid management, and provide them with fl exible and powerful confi guration and reporting options. Our customers value that highly.” Thomas Gebert HPC Solution Architect
    • 35 Uptime Automation Job and resource failures in enterprise HPC systems lead to de- layed results, missed organizational opportunities, and missed objectives. Moab HPC Suite – Enterprise Edition intelligently au- tomates workload and resource uptime in the HPC system to en- sure that workload completes successfully and reliably, avoiding these failures. Enterprises can benefit from: ʎʎ Intelligent resource placement to prevent job failures with granular resource modeling to meet workload requirements and avoid at-risk resources ʎʎ Auto-response to failures and events with configurable ac- tions to pre-failure conditions, amber alerts, or other metrics and monitors ʎʎ Workload-aware future maintenance scheduling that helps maintain a stable HPC system without disrupting workload productivity ʎʎ Real-world expertise for fast time-to-value and system uptime with included implementation, training, and 24x7 support remote services Auto SLA Enforcement Moab HPC Suite – Enterprise Edition uses the powerful Moab intelligence engine to optimally schedule and dynamically adjust workload to consistently meet service level agree- ments (SLAs), guarantees, or business priorities. This auto- matically ensures that the right workloads are completed at the optimal times, taking into account the complex number of using departments, priorities and SLAs to be balanced. Moab provides: ʎʎ Usage accounting and budget enforcement that schedules resources and reports on usage in line with resource shar- ing agreements and precise budgets (includes usage limits,
    • 36 usage reports, auto budget management and dynamic fair- share policies) ʎʎ SLA and priority polices that make sure the highest priorty workloads are processed first (includes Quality of Service, hierarchical priority weighting) ʎʎ Continuous plus future scheduling that ensures priorities and guarantees are proactively met as conditions and work- load changes (i.e. future reservations, pre-emption, etc.) Grid- and Cloud-Ready Workload Management The benefits of a traditional HPC environment can be extended to more efficiently manage and meet workload demand with the ulti-cluster grid and HPC cloud management capabilities in Moab HPC Suite - Enterprise Edition. It allows you to: ʎʎ Showback or chargeback for pay-for-use so actual resource usage is tracked with flexible chargeback rates and report- ing by user, department, cost center, or cluster ʎʎ Manage and share workload across multiple remote clusters to meet growing workload demand or surges by adding on the Moab HPC Suite – Grid Option. Moab HPC Suite – Grid Option The Moab HPC Suite - Grid Option is a powerful grid-workload management solution that includes scheduling, advanced poli- cy management, and tools to control all the components of ad- vanced grids. Unlike other grid solutions, Moab HPC Suite - Grid Option connects disparate clusters into a logical whole, enabling grid administrators and grid policies to have sovereignty over all systems while preserving control at the individual cluster. Moab HPC Suite - Grid Option has powerful applications that al- low organizations to consolidate reporting; information gather- ing; and workload, resource, and data management. Moab HPC Suite - Grid Option delivers these services in a near-transparent way: users are unaware they are using grid resources—they know only that they are getting work done faster and more eas- ily than ever before. Intelligent HPC Workload Management Moab HPC Suite – Grid Option
    • 37 Moab manages many of the largest clusters and grids in the world. Moab technologies are used broadly across Fortune 500 companies and manage more than a third of the compute cores in the top 100 systems of the Top500 supercomputers. Adaptive Computing is a globally trusted ISV (independent software ven- dor), and the full scalability and functionality Moab HPC Suite with the Grid Option offers in a single integrated solution has traditionally made it a significantly more cost-effective option than other tools on the market today. Components The Moab HPC Suite - Grid Option is available as an add-on to Moab HPC Suite - Basic Edition and -Enterprise Edition. It extends the capabilities and functionality of the Moab HPC Suite compo- nents to manage grid environments including: ʎʎ Moab Workload Manager for Grids – a policy-based work- load management and scheduling multi-dimensional deci- sion engine ʎʎ Moab Cluster Manager for Grids – a powerful and unified graphical administration tool for monitoring, managing and reporting tool across multiple clusters ʎʎ Moab ViewpointTM for Grids – a Web-based self-service end-user job-submission and management portal and ad- ministrator dashboard Benefits ʎʎ Unified management across heterogeneous clusters provides ability to move quickly from cluster to optimized grid ʎʎ Policy-driven and predictive scheduling ensures that jobs start and run in the fastest time possible by selecting optimal resources ʎʎ Flexible policy and decision engine adjusts workload pro- cessing at both grid and cluster level ʎʎ Grid-wide interface and reporting tools provide view of grid resources, status and usage charts, and trends over time for capacity planning, diagnostics, and accounting ʎʎ Advanced administrative control allows various business units to access and view grid resources, regardless of physical or organizational boundaries, or alternatively restricts access to resources by specific departments or entities ʎʎ Scalable architecture to support peta-scale, highthroughput computing and beyond Grid Control with Automated Tasks, Policies, and Re- porting ʎʎ Guarantee that the most-critical work runs first with flexible global policies that respect local cluster policies but continue to support grid service-level agreements ʎʎ Ensure availability of key resources at specific times with ad- vanced reservations ʎʎ Tune policies prior to rollout with cluster- and grid-level simu- lations ʎʎ Use a global view of all grid operations for self-diagnosis, plan- ning, reporting, and accounting across all resources, jobs, and clusters Moab HPC Suite -Grid Option can be flexibly configured for centralized, cen- tralized and local, or peer-to-peer grid policies, decisions and rules. It is able to manage multiple resource managers and multiple Moab instances.
    • 38 HarnessingthePowerofIntelXeonPhiandGPGPUs The mathematical acceleration delivered by many-core General Purpose Graphics Processing Units (GPGPUs) offers significant performance advantages for many classes of numerically in- tensive applications. Parallel computing tools such as NVIDIA’s CUDA and the OpenCL framework have made harnessing the power of these technologies much more accessible to develop- ers, resulting in the increasing deployment of hybrid GPGPU- based systems and introducing significant challenges for ad- ministrators and workload management systems. The new Intel Xeon Phi coprocessors offer breakthrough perfor- mance for highly parallel applications with the benefit of lever- aging existing code and flexibility in programming models for faster application development. Moving an application to Intel Xeon Phi coprocessors has the promising benefit of requiring substantially less effort and lower power usage. Such a small investment can reap big performance gains for both data-inten- sive and numerical or scientific computing applications that can make the most of the Intel Many Integrated Cores (MIC) technol- ogy. So it is no surprise that organizations are quickly embracing these new coprocessors in their HPC systems. The main goal is to create breakthrough discoveries, products and research that improve our world. Harnessing and optimiz- ing the power of these new accelerator technologies is key to doing this as quickly as possible. Moab HPC Suite helps organi- zations easily integrate accelerator and coprocessor technolo- gies into their HPC systems, optimizing their utilization for the right workloads and problems they are trying to solve. Intelligent HPC Workload Management OptimizingAcceleratorswithMoabHPCSuite
    • 39 Moab HPC Suite automatically schedules hybrid systems incor- porating Intel Xeon Phi coprocessors and GPGPU accelerators, optimizing their utilization as just another resource type with policies. Organizations can choose the accelerators that best meet their different workload needs. Managing Hybrid Accelerator Systems Hybrid accelerator systems add a new dimension of manage- ment complexity when allocating workloads to available re- sources. In addition to the traditional needs of aligning work- load placement with software stack dependencies, CPU type, memory, and interconnect requirements, intelligent workload management systems now need to consider: ʎʎ Workload’s ability to exploit Xeon Phi or GPGPU technology ʎʎ Additional software dependenciesand reduce costs, including topology-based allocation ʎʎ Current health and usage status of available Xeon Phis or GPGPUs ʎʎ Resource configuration for type and number of Xeon Phis or GPGPUs Moab HPC Suite auto detects which types of accelerators are where in the system to reduce the management effort and costs as these processors are introduced and maintained together in an HPC system. This gives customers maximum choice and per- formance in selecting the accelerators that work best for each of their workloads, whether Xeon Phi, GPGPU or a hybrid mix of the two. Moab HPC Suite optimizes accelerator utilization with policies that ensure the right ones get used for the right user and group jobs at the right time.
    • 40 Optimizing Intel Xeon Phi and GPGPUs Utilization System and application diversity is one of the first issues workload management software must address in optimizing accelerator utilization. The ability of current codes to use GP- GPUs and Xeon Phis effectively, ongoing cluster expansion, and costs usually mean only a portion of a system will be equipped with one or a mix of both types of accelerators. Moab HPC Suite has a twofold role in reducing the complexity of their administration and optimizing their utilization. First, it must be able to automatically detect new GPGPUs or Intel Xeon Phi coprocessors in the environment and their availabil- ity without the cumbersome burden of manual administra- tor configuration. Second, and most importantly, it must ac- curately match Xeon Phi-bound or GPGPUenabled workloads with the appropriately equipped resources in addition to managing contention for those limited resources according to organizational policy. Moab’s powerful allocation and pri- oritization policies can ensure the right jobs from users and groups get to the optimal accelerator resources at the right time. This keeps the accelerators at peak utilization for the right priority jobs. Moab’s policies are a powerful ally to administrators in auto determining the optimal Xeon Phi coprocessor or GPGPU to use and ones to avoid when scheduling jobs. These alloca- tion policies can be based on any of the GPGPU or Xeon Phi metrics such as memory (to ensure job needs are met), tem- perature (to avoid hardware errors or failures), utilization, or other metrics. IntelligentHPC WorkloadManagement OptimizingAcceleratorswithMoabHPCSuite
    • 41 Use the following metrics in policies to optimize allocation of Intel Xeon Phi coprocessors: ʎʎ Number of Cores ʎʎ Number of Hardware Threads ʎʎ Physical Memory ʎʎ Free Physical Memory ʎʎ Swap Memory ʎʎ Free Swap Memory ʎʎ Max Frequency (in MHz) Use the following metrics in policies to optimize allocation of GPGPUs and for management diagnostics: ʎʎ Error counts; single/double-bit, commands to reset counts ʎʎ Temperature ʎʎ Fan speed ʎʎ Memory; total, used, utilization ʎʎ Utilization percent ʎʎ Metrics time stamp Improve GPGPU Job Speed and Success The GPGPU drivers supplied by vendors today allow multiple jobs to share a GPGPU without corrupting results, but sharing GPGPUs and other key GPGPU factors can significantly impact the speed and success of GPGPU jobs and the final level of ser- vice delivered to the system users and the organization. Consider the example of a user estimating the time for a GPGPU job based on exclusive access to a GPGPU, but the workload man- ager allowing the GPGPU to be shared when the job is scheduled. The job will likely exceed the estimated time and be terminated by the workload manager unsuccessfully, leaving a very unsatis- fied user and the organization, group or project they represent. Moab HPC Suite provides the ability to schedule GPGPU jobs to run exclusively and finish in the shortest time possible for cer- tain types, or classes of jobs or users to ensure job speed and success.
    • 42 Cluster Sovereignty and Trusted Sharing ʎʎ Guarantee that shared resources are allocated fairly with global policies that fully respect local cluster configuration and needs ʎʎ Establish trust between resource owners through graphical usage controls, reports, and accounting across all shared resources ʎʎ Maintain cluster sovereignty with granular settings to con- trol where jobs can originate and be processed n Establish resource ownership and enforce appropriate access levels with prioritization, preemption, and access guarantees Increase User Collaboration and Productivity ʎʎ Reduce end-user training and job management time with easy-to-use graphical interfaces ʎʎ Enable end users to easily submit and manage their jobs through an optional web portal, minimizing the costs of ca- tering to a growing base of needy users ʎʎ Collaborate more effectively with multicluster co-allocation, allowing key resource reservations for high-priority projects ʎʎ Leverage saved job templates, allowing users to submit mul- Intelligent HPC Workload Management Moab HPC Suite – Application Portal Edition The Visual Cluster view in Moab Cluster Manager for Grids makes cluster and grid management easy and efficient.
    • 43 tiple jobs quickly and with minimal changes ʎʎ Speed job processing with enhanced grid placement options for job arrays; optimal or single cluster placement Process More Work in Less Time to Maximize ROI ʎʎ Achieve higher, more consistent resource utilization with in- telligent scheduling, matching job requests to the bestsuited resources, including GPGPUs ʎʎ Use optimized data staging to ensure that remote data trans- fers are synchronized with resource availability to minimize poor utilization ʎʎ Allow local cluster-level optimizations of most grid workload Unify Management Across Independent Clusters ʎʎ Unify management across existing internal, external, and partner clusters – even if they have different resource man- agers, databases, operating systems, and hardware ʎʎ Out-of-the-box support for both local area and wide area grids ʎʎ Manage secure access to resources with simple credential mapping or interface with popular security tool sets ʎʎ Leverage existing data-migration technologies, such as SCP or GridFTP Moab HPC Suite – Application Portal Edition Remove the Barriers to Harnessing HPC Organizations of all sizes and types are looking to harness the potential of high performance computing (HPC) to enable and accelerate more competitive product design and research. To do this, you must find a way to extend the power of HPC resourc- es to designers, researchers and engineers not familiar with or trained on using the specialized HPC technologies. Simplifying the user access to and interaction with applications running on more efficient HPC clusters removes the barriers to discovery and innovation for all types of departments and organizations. Moab® HPC Suite – Application Portal Edition provides easy single-point access to common technical applications, data, HPC resources, and job submissions to accelerate research analysis and collaboration process for end users while reducing comput- ing costs. HPC Ease of Use Drives Productivity Moab HPC Suite – Application Portal Edition improves ease of use and productivity for end users using HPC resources with sin- gle-point access to their common technical applications, data, resources, and job submissions across the design and research process. It simplifies the specification and running of jobs on HPC resources to a few simple clicks on an application specific template with key capabilities including: ʎʎ Application-centric job submission templates for common manufacturing, energy, life-sciences, education and other industry technical applications ʎʎ Support for batch and interactive applications, such as simulations or analysis sessions, to accelerate the full proj- ect or design cycle ʎʎ No special HPC training required to enable more users to start accessing HPC resources with intuitive portal ʎʎ Distributed data management avoids unnecessary remote file transfers for users, storing data optimally on the network for easy access and protection, and optimizing file transfer to/ from user desktops if needed Accelerate Collaboration While Maintaining Control More and more projects are done across teams and with partner and industry organizations. Moab HPC Suite – Application Portal Edition enables you to accelerate this collaboration between indi- viduals and teams while maintaining control as additional users, inside and outside the organization, can easily and securely ac- cessprojectapplications,HPCresourcesanddatatospeedproject cycles. Key capabilities and benefits you can leverage are: ʎʎ Encrypted and controllable access, by class of user, ser- vices, application or resource, for remote and partner users improves collaboration while protecting infrastructure and intellectual property
    • 44 ʎʎ Enable multi-site, globally available HPC services available anytime, anywhere, accessible through many different types of devices via a standard web browser ʎʎ Reduced client requirements for projectsmeans new proj- ect members can quickly contribute without being limited by their desktop capabilities Reduce Costs with Optimized Utilization and Sharing Moab HPC Suite – Application Portal Edition reduces the costs of technical computing by optimizing resource utilization and the sharing of resources including HPC compute nodes, application licenses, and network storage. The patented Moab intelligence engine and its powerful policies deliver value in key areas of uti- lization and sharing: ʎʎ Maximize HPC resource utilization with intelligent, opti- mized scheduling policies that pack and enable more work to be done on HPC resources to meet growing and changing demand ʎʎ Optimize application license utilization and access by shar- ing application licenses in a pool, re-using and optimizing us- age with allocation policies that integrate with license man- agers for lower costs and better service levels to a broader set of users ʎʎ Usage budgeting and priorities enforcement with usage accounting and priority policies that ensure resources are shared in-line with service level agreements and project pri- orities ʎʎ Leverage and fully utilize centralized storage capacity in- stead of duplicative, expensive, and underutilized individual workstation storage Key Intelligent Workload Management Capabilities: ʎʎ Massive multi-point scalability ʎʎ Workload-optimized allocation policies and provisioning ʎʎ Unify workload management cross heterogeneous clusters ʎʎ Optimized, intelligent scheduling Intelligent HPC Workload Management MoabHPCSuite–RemoteVisualizationEdition
    • 45 ʎʎ Optimized scheduling and management of accelerators (both Intel MIC and GPGPUs) ʎʎ Administrator dashboards and reporting tools ʎʎ Workload-aware auto power management ʎʎ Intelligent resource placement to prevent job failures ʎʎ Auto-response to failures and events ʎʎ Workload-aware future maintenance scheduling ʎʎ Usage accounting and budget enforcement ʎʎ SLA and priority polices ʎʎ Continuous plus future scheduling MoabHPCSuite–RemoteVisualizationEdition Removing Inefficiencies in Current 3D Visualization Methods Using 3D and 2D visualization, valuable data is interpreted and new innovations are discovered. There are numerous inefficien- cies in how current technical computing methods support users and organizations in achieving these discoveries in a cost effec- tive way. High-cost workstations and their software are often not fully utilized by the limited user(s) that have access to them. They are also costly to manage and they don’t keep pace long with workload technology requirements. In addition, the work- station model also requires slow transfers of the large data sets between them. This is costly and inefficient in network band- width, user productivity, and team collaboration while posing in- creased data and security risks. There is a solution that removes these inefficiencies using new technical cloud computing mod- els. Moab HPC Suite – Remote Visualization Edition significantly reduces the costs of visualization technical computing while improving the productivity, collaboration and security of the de- sign and research process. ReducetheCostsofVisualizationTechnicalComputing Moab HPC Suite – Remote Visualization Edition significantly reduces the hardware, network and management costs of vi- sualization technical computing. It enables you to create easy, centralized access to shared visualization compute, applica- tion and data resources. These compute, application, and data resources reside in a technical compute cloud in your data cen- ter instead of in expensive and underutilized individual work- stations. ʎʎ Reduce hardware costs by consolidating expensive individual technical workstations into centralized visualization servers for higher utilization by multiple users; reduce additive or up- grade technical workstation or specialty hardware purchases, such as GPUs, for individual users. ʎʎ Reduce management costs by moving from remote user work- stations that are difficult and expensive to maintain, upgrade, and back-up to centrally managed visualization servers that require less admin overhead ʎʎ Decreasenetworkaccesscostsandcongestionassignificantly lower loads of just compressed, visualized pixels are moving to users, not full data sets ʎʎ Reduce energy usageas centralized visualization servers con- sume less energy for the user demand met ʎʎ Reduce data storage costs by consolidating data into common storage node(s) instead of under-utilized individual storage The Application Portal Edition easily extends the power of HPC to your users and projects to drive innovation and competentive advantage.
    • 46 Consolidation and Grid Many sites have multiple clusters as a result of having multiple independent groups or locations, each with demands for HPC, and frequent additions of newer machines. Each new cluster increases the overall administrative burden and overhead. Addi- tionally, many of these systems can sit idle while others are over- loaded. Because of this systems-management challenge, sites turn to grids to maximize the efficiency of their clusters. Grids can be enabled either independently or in conjunction with one another in three areas: ʎʎ Reporting Grids Managers want to have global reporting across all HPC resources so they can see how users and proj- ects are really utilizing hardware and so they can effectively plan capacity. Unfortunately, manually consolidating all of this information in an intelligible manner for more than just a couple clusters is a management nightmare. To solve that problem, sites will create a reporting grid, or share informa- tion across their clusters for reporting and capacity-planning purposes. ʎʎ Management Grids Managing multiple clusters indepen- dently can be especially difficult when processes change, because policies must be manually reconfigured across all clusters. To ease that difficulty, sites often set up manage- ment grids that impose a synchronized management layer across all clusters while still allowing each cluster some level of autonomy. ʎʎ Workload-Sharing Grids Sites with multiple clusters often have the problem of some clusters sitting idle while other clusters have large backlogs. Such inequality in cluster utili- zation wastes expensive resources, and training users to per- form different workload-submission routines across various clusters can be difficult and expensive as well. To avoid these problems, sites often set up workload-sharing grids. These grids can be as simple as centralizing user submission or as Intelligent HPC Workload Management Using Moab With Grid Environments
    • 47 complex as having each cluster maintain its own user sub- mission routine with an underlying grid-management tool that migrates jobs between clusters. Inhibitors to Grid Environments Three common inhibitors keep sites from enabling grid environ- ments: ʎʎ Politics Because grids combine resources across users, groups, and projects that were previously independent, grid implementation can be a political nightmare. To create a grid in the real world, sites need a tool that allows clusters to retain some level of sovereignty while participating in the larger grid. ʎʎ Multiple Resource Managers Most sites have a variety of resource managers used by various groups within the orga- nization, and each group typically has a large investment in scripts that are specific to one resource manager and that cannot be changed. To implement grid computing effective- ly, sites need a robust tool that has flexibility in integrating with multiple resource managers. ʎʎ Credentials Many sites have different log-in credentials for each cluster, and those credentials are generally indepen- dent of one another. For example, one user might be Joe.P on one cluster and J_Peterson on another. To enable grid envi- ronments, sites must create a combined user space that can recognize and combine these different credentials. Using Moab in a Grid Sites can use Moab to set up any combination of reporting, management, and workload-sharing grids. Moab is a grid metas- cheduler that allows sites to set up grids that work effectively in the real world. Moab’s feature-rich functionality overcomes the inhibitors of politics, multiple resource managers, and varying credentials by providing: ʎʎ Grid Sovereignty Moab has multiple features that break down political barriers by letting sites choose how each clus- ter shares in the grid. Sites can control what information is shared between clusters and can specify which workload is passed between clusters. In fact, sites can even choose to let each cluster be completely sovereign in making decisions about grid participation for itself. ʎʎ Support for Multiple Resource Managers Moab meta- schedules across all common resource managers. It fully integrates with TORQUE and SLURM, the two most-common open-source resource managers, and also has limited in- tegration with commercial tools such as PBS Pro and SGE. Moab’s integration includes the ability to recognize when a user has a script that requires one of these tools, and Moab can intelligently ensure that the script is sent to the correct machine. Moab even has the ability to translate common scripts across multiple resource managers. ʎʎ Credential Mapping Moab can map credentials across clus- ters to ensure that users and projects are tracked appropri- ately and to provide consolidated reporting.
    • 48 Improve Collaboration, Productivity and Security With Moab HPC Suite – Remote Visualization Edition, you can im- prove the productivity, collaboration and security of the design and research process by only transferring pixels instead of data to users to do their simulations and analysis. This enables a wid- er range of users to collaborate and be more productive at any time, on the same data, from anywhere without any data trans- fer time lags or security issues. Users also have improved imme- diate access to specialty applications and resources, like GPUs, they might need for a project so they are no longer limited by personal workstation constraints or to a single working location. ʎʎ Improve workforce collaboration by enabling multiple users to access and collaborate on the same interactive ap- plication data at the same time from almost any location or device, with little or no training needed ʎʎ Eliminate data transfer time lags for users by keeping data close to the visualization applications and compute resourc- es instead of transferring back and forth to remote worksta- tions; only smaller compressed pixels are transferred ʎʎ Improve security with centralized data storage so data no longer gets transferred to where it shouldn’t, gets lost, or gets inappropriately accessed on remote workstations, only pixels get moved MaximizeUtilizationandSharedResourceGuarantees Moab HPC Suite – Remote Visualization Edition maximizes your resource utilization and scalability while guaranteeing shared resource access to users and groups with optimized workload management policies. Moab’s patented policy intelligence en- gine optimizes the scheduling of the visualization sessions across the shared resources to maximize standard and specialty resource utilization, helping them scale to support larger vol- umes of visualization workloads. These intelligent policies also guarantee shared resource access to users to ensure a high ser- vice level as they transition from individual workstations. Intelligent HPC Workload Management MoabHPCSuite–RemoteVisualizationEdition
    • 49 ʎʎ Guarantee shared visualization resource access for users with priority policies and usage budgets that ensure they receive the appropriate service level. Policies include ses- sion priorities and reservations, number of users per node, fairshare usage policies, and usage accounting and budgets across multiple groups or users. ʎʎ Maximize resource utilization and scalability by packing visualization workload optimally on shared servers using Moab allocation and scheduling policies. These policies can include optimal resource allocation by visualization session characteristics (like CPU, memory, etc.), workload packing, number of users per node, and GPU policies, etc. ʎʎ Improve application license utilization to reduce software costs and improve access with a common pool of 3D visual- ization applications shared across multiple users, with usage optimized by intelligent Moab allocation policies, integrated with license managers. ʎʎ Optimized scheduling and management of GPUs and other accelerators to maximize their utilization and effectiveness ʎʎ Enable multiple Windows or Linux user sessions on a single machine managed by Moab to maximize resource and power utilization. ʎʎ Enable dynamic OS re-provisioning on resources, managed by Moab, to optimally meet changing visualization applica- tion workload demand and maximize resource utilization and availability to users. The Remote Visualization Edition makes visualization more cost effective and productive with an innovative technical compute cloud.
    • transtec HPC solutions are designed for maximum flexibility, and ease of management. We not only offer our customers the most powerful and flexible cluster management solution out there, but also provide them with customized setup and site- specific configuration. Whether a customer needs a dynamical Linux-Windows dual-boot solution, unifi ed management of dif- ferent clusters at different sites, or the fi ne-tuning of the Moab scheduler for implementing fi ne-grained policy confi guration – transtec not only gives you the framework at hand, but also helps you adapt the system according to your special needs. Needless to say, when customers are in need of special train- ings, transtec will be there to provide customers, administra- tors, or users with specially adjusted Educational Services. Having a many years’ experience in High Performance Comput- ing enabled us to develop effi cient concepts for installing and deploying HPC clusters. For this, we leverage well-proven 3rd party tools, which we assemble to a total solution and adapt and confi gure according to the customer’s requirements. We manage everything from the complete installation and confi guration of the operating system, necessary middleware components like job and cluster management systems up to the customer’s applications. 50 Intelligent HPC Workload Management MoabHPCSuite–RemoteVisualizationEdition
    • Moab HPC Suite Use Case Basic Enterprice App. Portal Remote Visualization ProductivityAcceleration Massive multifaceted scalability x x x x Workload optimized resource allocation x x x x Optimized, intelligent scheduling of work- load and resources x x x x Optimized accelerator scheduling & manage- ment (Intel MIC & GPGPU) x x x x Simplified user job submission & mgmt. - Application Portal x x x x x Administrator dashboard and management reporting x x x x Unified workload management across het- erogeneous cluster (s) x Single cluster only x x x Workload optimized node provisioning x x x Visualization Workload Optimization x Workload-aware auto-power management x x x Uptime Automation Intelligent resource placement for failure prevention x x x x Workload-aware maintenance scheduling x x x x Basic deployment, training and premium (24x7) support service Standard support only x x x Auto response to failures and events x Limited no re-boot/provision x x x Auto SLA enforcement Resource sharing and usage policies x x x x SLA and oriority policies x x x x Continuous and future reservations scheduling x x x x Usage accounting and budget enforcement x x x Grid- and Cloud-Ready Manage and share workload across multiple clusters in wich area grid x w/grid Option x w/grid Option x w/grid Option x w/grid Option User self-service: job request & management portal x x x x Pay-for-use showback and chargeback x x x Workload optimized node provisioning & re-purposing x x x Multiple clusters in single domain 51
    • InfiniBand high-speed interconntects NICEEnginFrame– ATechnicalComputing Portal
    • 53 InfiniBand (IB) is an efficient I/O technology that provides high- speed data transfers and ultra-low latencies for computing and storage over a highly reliable and scalable single fabric. The InfiniBand industry standard ecosystem creates cost effective hardware and software solutions that easily scale from genera- tion to generation. InfiniBand is a high-bandwidth, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster community. InfiniBand was designed to take the place of today’s data center networking tech- nology. In the late 1990s, a group of next generation I/O architects formed an open, community driven network technology to provide scalability and stability based on successes from other network designs. Today, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomputers: major Uni- versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic, Reservoir, Modeling Applications); Computer Aided Design and En- gineering; Enterprise Oracle; and Financial Applications. HPC systems and enterprise grids deliver unprecedented time-to-market and perfor- mance advantages to many research and corporate cus- tomers, struggling every day with compute and data intensive processes. This of- ten generates or transforms massive amounts of jobs and data, that needs to be handled and archived effi- ciently to deliver timely in- formation to users distrib- uted in multiple locations, with different security con- cerns. Poor usability of such com- plex systems often nega- tively impacts users’ pro- ductivity, and ad-hoc data management often increas- es information entropy and dissipates knowledge and intellectual property.
    • 54 NICEEnginFrame A Technical Portal for Remote Visualization Solving distributed computing issues for our customers, it is easy to understand that a modern, user-friendly web front-end to HPC and grids can drastically improve engineering productiv- ity, if properly designed to address the specific challenges of the Technical Computing market. Nice EnginFrame overcomes many common issues in the areas of usability, data management, security and integration, open- ing the way to a broader, more effective use of the Technical Computing resources. The key components of our solutions are: ʎʎ a flexible and modular Java-based kernel, with clear separati- on between customizations and core services ʎʎ powerful data management features, reflecting the typical needs of engineering applications ʎʎ comprehensive security options and a fine grained authoriza- tion system ʎʎ scheduler abstraction layer to adapt to different workload and resource managers ʎʎ responsive and competent support services
    • 55 Enduserscantypicallyenjoythefollowingimprovements: ʎʎ user-friendly, intuitive access to computing resources, using a standard web browser ʎʎ application-centric job submission ʎʎ organized access to job information and data ʎʎ increased mobility and reduced client requirements On the other side, the Technical Computing Portal delivers sig- nificant added-value for system administrators and IT: ʎʎ reduced training needs to enable users to access the resour- ces ʎʎ centralized configuration and immediate deployment of ser- vices ʎʎ comprehensive authorization to access services and informa- tion ʎʎ reduced support calls and submission errors Coupled with our Remote Visualization solutions, our custom- ers quickly deploy end-to-end engineering processes on their Intranet, Extranet or Internet. EnginFrame Highlights ʎʎ Universal and flexible access to your Grid infrastructure En- ginFrame provides easy accessover intranet, extranet or the Internet using standard and secure Internet languages and protocols. ʎʎ Interface to the Grid in your organization EnginFrame offers pluggable server-side XML services to enable easy integrati- on of the leading workload schedulers in the market. Plug-in modules for all major grid solutions including the Platform Computing LSF and OCS, Microsoft Windows HPC Server 2008, Sun GridEngine, UnivaUD UniCluster, IBM LoadLeve- ler, Altair PBS/Pro and OpenPBS, Torque, Condor, EDG/gLite Globus-based toolkits, and provides easy interfaces to the existing computing infrastructure. ʎʎ Security and Access Control You can give encrypted and con- trollable access to remote users and improve collaboration with partners while protecting your infrastructure and Intel- lectualProperty(IP). Thisenablesyoutospeedupdesigncycles while working with related departments or external partners and eases communication through a secure infrastructure. Distributed Data Management The comprehensive remote file management built into EnginFrame avoids unnecessary file transfers, and enables server-side treatment of data, as well as file transfer from/to the user’s desktop. ʎʎ Interactive application support Through the integration with leading GUI virtualization solutions (including Desktop Cloud Visualization (DCV), VNC, VirtualGL and NX), EnginFrame enables you to address all stages of the design process, both batch and interactive. ʎʎ SOA-enable your applications and resources EnginFrame can automatically publish your computing applications via standard WebServices (tested both for Java and .NET intero- perability), enabling immediate integration with other enter- prise applications. Business Benefits Category Business Benefits Web and Cloud Portal ʎʎ Simplify user access; minimize the training requirements for HPC users; broaden the user community ʎʎ Enable access from intranet, internet or extranet
    • 56 Aerospace, Automotive and Manufacturing The complex factors involved in CAE range from compute-in- tensive data analysis to worldwide collaboration between de- signers, engineers, OEM’s and suppliers. To accommodate these factors, Cloud (both internal and external) and Grid Computing solutions are increasingly seen as a logical step to optimize IT infrastructure usage for CAE. It is no surprise then, that automo- tive and aerospace manufacturers were the early adopters of internal Cloud and Grid portals. Manufacturers can now develop more “virtual products” and simulate all types of designs, fluid flows, and crash simula- tions. Such virtualized products and more streamlined collab- oration environments are revolutionizing the manufacturing process. With NICE EnginFrame in their CAE environment engineers can take the process even further by connecting design and simula- tion groups in “collaborative environments” to get even greater benefits from “virtual products”. Thanks to EnginFrame, CAE en- gineers can have a simple intuitive collaborative environment that can care of issues related to: ʎʎ Access & Security - Where an organization must give access to external and internal entities such as designers, engineers and suppliers. ʎʎ Distributed collaboration - Simple and secure connection of design and simulation groups distributed worldwide. ʎʎ Time spent on IT tasks - By eliminating time and resources spent using cryptic job submission commands or acquiring knowledge of underlying compute infrastructures, engi- neers can spend more time concentrating on their core de- sign tasks. NICEEnginFrame Application Highlights
    • 57 EnginFrame’s web based interface can be used to access the compute resources required for CAE processes. This means ac- cess to job submission & monitoring tasks, input & output data associated with industry standard CAE/CFD applications for Flu- id-Dynamics, Structural Analysis, Electro Design and Design Col- laboration, (like Abaqus, Ansys, Fluent, MSC Nastran, PAMCrash, LS-Dyna, Radioss) without cryptic job submission commands or knowledge of underlying compute infrastructures. EnginFrame has a long history of usage in some of the most prestigious manufacturing organizations worldwide including Aerospace companies like AIRBUS, Alenia Space, CIRA, Galileo Avi- onica, Hamilton Sunstrand, Magellan Aerospace, MTU and Auto- motive companies like Audi, ARRK, Brawn GP, Bridgestone, Bosch, Delphi, Elasis, Ferrari, FIAT, GDX Automotive, Jaguar-LandRover, Lear, Magneti Marelli, McLaren, P+Z, RedBull, Swagelok, Suzuki, Toyota, TRW. Life Sciences and Healthcare NICE solutions are deployed in the Life Sciences sector at com- panies like BioLab, Partners HealthCare, Pharsight and the M.D.Anderson Cancer Center; also in leading research projects like DEISA or LitBio in order to allow easy and transparent use of com- puting resources without any insight of the HPC infrastructure. The Life Science and Healthcare sectors have some very strict re- quirement when choosing its IT solution like EnginFrame, for in- stance ʎʎ Security - To meet the strict security & privacy requirements of the biomedical and pharmaceutical industry any solution needs to take account of multiple layers of security and au- thentication. ʎʎ Industry specific software - ranging from the simplest cus- tom tool to the more general purpose free and open middle- wares. EnginFrame’s modular architecture allows for different Grid mid- dlewares and software including leading Life Science applications like Schroedinger Glide, EPFL RAxML, BLAST family, Taverna, and R) to be exploited, R Users can compose elementary services into complex applications and “virtual experiments”, run, monitor and build workflows via a standard Web browser. EnginFrame also has highly tuneable resource sharing and fine-grained access control where multiple authentication systems (like Active Directory, Krb5 or LDAP) can be exploited simultaneously.
    • 58 Web and Cloud Portal ʎʎ “One Stop Shop” - the portal as the starting point for knowledge and resources ʎʎ Enable collaboration and sharing through the portal as a virtual meeting point (employees, part- ners, and affiliates) ʎʎ Single Sign-on ʎʎ Integrations for many technical computing ISV and open-source applications and infrastructure components Business Continuity ʎʎ Virtualizes servers and server locations, data and storage loca- tions, applications and licenses ʎʎ Move to thin desktops for graphi- cal workstation applications by moving graphics processing into the datacenter (also can increase utilization and re-use of licenses) ʎʎ Enable multi-site and global loca- tions and services ʎʎ Data maintained on network sto- rage - not on laptop/desktop ʎʎ Cloud and ASP – create, or take advantage of, Cloud and ASP (Application Service Provider) services to extend your business, grow new revenue, manage costs. NICEEnginFrame Desktop Cloud Virtualization “The amount of data resulting from e.g. simulations in CAE or other engineering environments can be in the Gigabyte range. It is obvious that remote post- processingisoneofthemosturgenttopicstobetack- led. NICE Engine Frame provides exactly that, and our customers are impressed that such great technology enhances their workflow so significantly.” Robin Kienecker HPC Sales Specialist
    • 59 Compliance ʎʎ Ensure secure and auditable access and use of resources (ap- plications, data, infrastructure, licenses) ʎʎ Enables IT to provide better upti- me and service-levels to users ʎʎ Allow collaboration with partners while protecting Intellectual Property and resources ʎʎ Restrict access by class of user, service, application, and resource Web Services ʎʎ Wrapper legacy applications (command line, local API/SDK) with a web services SOAP/.NET/ Java interface to access in your SOA environment ʎʎ Workflow your human and auto- mated business processes and supply chain ʎʎ Integrate with corporate policy for Enterprise Portal or Business Process Management (works with SharePoint®, WebLogic®, WebS- phere®, etc). Desktop Cloud Virtualization Nice Desktop Cloud Visualization (DCV) is an advanced technol- ogy that enables Technical Computing users to remote access 2D/3D interactive applications over a standard network. Engineers and scientists are immediately empowered by taking full advantage of high-end graphics cards, fast I/O performance and large memory nodes hosted in “Public or Private 3D Cloud”, rather then waiting for the next upgrade of the workstations. TheDCVprotocoladaptstoheterogeneousnetworkinginfrastruc- tures like LAN, WAN and VPN, to deal with bandwidth and latency constraints. All applications run natively on the remote machines, that could be virtualized and share the same physical GPU. In a typical visualization scenario, a software application sends a stream of graphics commands to a graphics adapter through an input/output (I/O) interface. The graphics adapter renders the data into pixels and outputs them to the local display as a video signal. When using Nice DCV, the scene geometry and graphics state are rendered on a central server, and pixels are sent to one or more remote displays. This approach requires the server to be equipped with one or more GPUs, which are used for the OpenGL rendering, while the client software can run on “thin” devices. Nice DCV architecture consist of: ʎʎ DCVServer,equippedwithoneormoreGPUs,usedforOpenGL rendering ʎʎ one or more DCV EndStations, running on “thin clients”, only used for visualization ʎʎ etherogeneous networking infrastructures (like LAN, WAN and VPN), optimized balancing quality vs frame rate Nice DCV Highlights ʎʎ enables high performance remote access to interactive 2D/3D software applications on low bandwidth/high latency ʎʎ supports multiple etherogeneous OS (Windows, Linux) ʎʎ enables GPU sharing ʎʎ supports 3D acceleration for OpenGL applications running on Virtual Machines
    • 60 Remote Visualization It’s human nature to want to ‘see’ the results from simula- tions, tests, and analyses. Up until recently, this has meant ‘fat’ workstations on many user desktops. This approach pro- vides CPU power when the user wants it – but as dataset size increases, there can be delays in downloading the results. Also, sharing the results with colleagues means gathering around the workstation - not always possible in this global- ized, collaborative workplace. Increasing dataset complexity (millions of polygons, interact- ing components, MRI/PET overlays) means that as time comes to upgrade and replace the workstations, the next generation of hardware needs more memory, more graphics processing, more disk, and more cpu cores. This makes the workstation expensive, in need of cooling, and noisy. Innovation in the field of remote 3D processing now allows companies to address these issues moving applications away from the Desktop into the data center. Instead of pushing data to the application, the application can be moved near the data. Instead of mass workstation upgrades, Remote Visualization allows incremental provisioning, on-demand allocation, bet- ter management and efficient distribution of interactive ses- sions and licenses. Racked workstations or blades typically have lower maintenance, cooling, replacement costs, and they can extend workstation (or laptop) life as “thin clients”. NICEEnginFrame Remote Visualization
    • 61 The solution Leveraging their expertise in distributed computing and Web- based application portals, Nice offers an integrated solution to access, load balance and manage applications and desktop ses- sions running within a visualization farm. The farm can include both Linux and Windows resources, running on heterogeneous hardware. The core of the solution is the EnginFrame Visualization plug-in, that delivers Web-based services to access and manage applica- tions and desktops published in the Farm. This solution has been integrated with:  NICE Desktop Cloud Visualization (DCV)  HP Remote Graphics Software (RGS)  RealVNC  TurboVNC and VirtualGL  Nomachine NX Coupled with these third party remote visualization engines (which specialize in delivering high frame-rates for 3D graphics), the NICE offering for Remote Visualization solves the issues of user authentication, dynamic session allocation, session man- agement and data transfers. End users can enjoy the following improvements: ʎʎ Intuitive, application-centric Web interface to start, control and re-connect to a session ʎʎ Single sign-on for batch and interactive applications ʎʎ All data transfers from and to the remote visualization farm are handled by EnginFrame ʎʎ Built-in collaboration, to share sessions with other users ʎʎ The load and usage of the visualization cluster is monitored in the browser The solution also delivers significant added-value for the sys- tem administrators: ʎʎ No need of SSH / SCP / FTP on the client machine ʎʎ Easy integration into identity services, Single Sign-On (SSO), Enterprise portals ʎʎ Automated data life cycle management ʎʎ Built-in user session sharing, to facilitate support ʎʎ Interactive sessions are load balanced by a scheduler (LSF, SGE or Torque) to achieve optimal performance and resource usage ʎʎ Better control and use of application licenses ʎʎ Monitor, control and manage users’ idle sessions transtec has a long-term experience in Engineering environ- ments, especially from the CAD/CAE sector. This allow us to provide customers from this area with solutions that greatly en- hance their workflow and minimizes time-to-result. This, together with transtec’s offerings of all kinds of services, al- lows our customers to fully focus on their productive work, and have us do the environmental optimizations.
    • 62 ʎʎ Supports multiple user collaboration via session sharing ʎʎ Enables attractive Return-on-Investment through resource sharing and consolidation to data centers (GPU, memory, CPU, ...) ʎʎ Keeps the data secure in the data center, reducing data load and save time ʎʎ Enables right sizing of system allocation based on user’s dy- namic needs ʎʎ Facilitates application deployment: all applications, updates and patches are instantly available to everyone, without any changes to original code Business Benefits The business benefits for adopting Nice DCV can be summarized in to four categories: Category Business Benefits Productivity ʎʎ Increase business efficiency ʎʎ Improve team performance by ensuring real-time collaboration with colleagues and partners in real time, anywhere. ʎʎ Reduce IT management costs by consolidating workstation resources to a single point-of- management ʎʎ Save money and time on applica- tion deployment ʎʎ Let users work from anywhere there is an Internet connection NICEEnginFrame Desktop Cloud Virtualization
    • 63 Business Continuity ʎʎ Move graphics processing and data to the datacenter - not on laptop/desktop ʎʎ Cloud-based platform support enables you to scale the visua- lization solution „on-demand“ to extend business, grow new revenue, manage costs. Data Security ʎʎ Guarantee secure and auditable use of remote resources (appli- cations, data, infrastructure, licenses) ʎʎ Allow real-time collaboration with partners while protecting In- tellectual Property and resources ʎʎ Restrict access by class of user, service, application, and resource Training Effectiveness ʎʎ Enable multiple users to follow application procedures alongside an instructor in real-time ʎʎ Enable collaboration and session sharing among remote users (em- ployees, partners, and affiliates) Nice DCV is perfectly integrated into EnginFrame Views, lever- aging 2D/3D capabilities over the Web, including the ability to share an interactive session with other users for collaborative working. Windows Linux ʎʎ Microsoft Windows 7 - 32/64 bit ʎʎ Microsoft Windows Vista - 32/64 bit ʎʎ Microsoft Windows XP - 32/64 bit ʎʎ RedHat Enterprise 4, 5, 5.5, 6 - 32/64 bit ʎʎ SUSE Enterprise Server 11 - 32/64 bit © 2012 by NICE
    • IntelClusterReady – AQualityStandardfor HPCClusters
    • Intel Cluster Ready is de- signed to create predict- able expectations for users and providers of HPC clus- ters, primarily targeting customers in the commer- cial and industrial sectors. These are not experimental “test-bed” clusters used for computer science and com- puter engineering research, or high-end “capability” clusters closely targeting their specific computing re- quirements that power the high-energy physics at the national labs or other spe- cialized research organiza- tions. Intel Cluster Ready seeks to advance HPC clusters used as computing resources in production environments by providing cluster owners with a high degree of confi- dence that the clusters they deploy will run the applica- tions their scientific and en- gineering staff rely upon to do their jobs. It achieves this by providing cluster hard- ware, software, and system providers with a precisely defined basis for their prod- ucts to meet their customers’ production cluster require- ments. 65
    • 66 IntelClusterReady A Quality Standard for HPC Clusters What are the Objectives of ICR? The primary objective of Intel Cluster Ready is to make clusters easier to specify, easier to buy, easier to deploy, and make it easi- er to develop applications that run on them. A key feature of ICR is the concept of “application mobility”, which is defined as the ability of a registered Intel Cluster Ready application – more cor- rectly, the same binary – to run correctly on any certified Intel Cluster Ready cluster. Clearly, application mobility is important for users, software providers, hardware providers, and system providers. ʎʎ Users want to know the cluster they choose will reliably run the applications they rely on today, and will rely on tomorrow ʎʎ Application providers want to satisfy the needs of their cus- tomers by providing applications that reliably run on their customers’ cluster hardware and cluster stacks ʎʎ Cluster stack providers want to satisfy the needs of their cus- tomers by providing a cluster stack that supports their custo- mers’ applications and cluster hardware ʎʎ Hardware providers want to satisfy the needs of their custo- mers by providing hardware components that supports their customers’ applications and cluster stacks ʎʎ System providers want to satisfy the needs of their customers by providing complete cluster implementations that reliably run their customers’ applications Without application mobility, each group above must either try to support all combinations, which they have neither the time nor resources to do, or pick the “winning combination(s)” that best supports their needs, and risk making the wrong choice. The Intel Cluster Ready definition of application portability sup- ports all of these needs by going beyond pure portability, (re- compiling and linking a unique binary for each platform), to ap- plication binary mobility, (running the same binary on multiple platforms), by more precisely defining the target system. “The Intel Cluster Checker allows us to certify that our transtec HPC clusters are compliant with an independent high quality standard. Our customers can rest assured: their applications run as they ex- pect.” Marcus Wiedemann HPC Solution Engineer
    • 67 A further aspect of application mobility is to ensure that regis- tered Intel Cluster Ready applications do not need special pro- gramming or alternate binaries for different message fabrics. Intel Cluster Ready accomplishes this by providing an MPI imple- mentation supporting multiple fabrics at runtime; through this, registered Intel Cluster Ready applications obey the “message layer independence property”. Stepping back, the unifying con- cept of Intel Cluster Ready is “one-to-many,” that is  One application will run on many clusters  One cluster will run many applications How is one-to-many accomplished? Looking at Figure 1, you see the abstract Intel Cluster Ready “stack” components that always exist in every cluster, i.e., one or more applications, a cluster soft- ware stack, one or more fabrics, and finally the underlying clus- ter hardware. The remainder of that picture (to the right) shows the components in greater detail. Applications, on the top of the stack, rely upon the various APIs, utilities, and file system structure presented by the underly- ing software stack. Registered Intel Cluster Ready applications are always able to rely upon the APIs, utilities, and file system structure specified by the Intel Cluster Ready Specification; if an application requires software outside this “required” set, then Intel Cluster Ready requires the application to provide that soft- ware as a part of its installation. To ensure that this additional per-application software doesn’t conflict with the cluster stack or other applications, Intel Cluster Ready also requires the addi- tional software to be installed in application-private trees, so the application knows how to find that software while not interfer- ing with other applications. While this may well cause duplicate software to be installed, the reliability provided by the duplica- tion far outweighs the cost of the duplicated files. A prime ex- ample supporting this comparison is the removal of a common file (library, utility, or other) that is unknowingly needed by some ICR Stack SoftwareStack Fabric System CFD Registered Applications A Single Solution Platform Fabrics Certified Cluster Platforms Crash Climate QCD Intel MPI Library (run-time) Intel MKL Cluster Edition (run-time) Linux Cluster Tools (Intel Selected) InfiniBand (OFED) Intel Xeon Processor Platform Gigabit Ethernet 10Gbit Ethernet Bio Optional Value Add by Individual Platform Integrator Intel Development Tools (C++, Intel Trace Analyzer and Collector, MKL, etc.) ... Intel OEM1 OEM2 PI1 PI2 ...
    • 68 other application – such errors can be insidious to repair even when they cause an outright application failure. Cluster platforms, at the bottom of the stack, provide the APIs, utilities, and file system structure relied upon by registered ap- plications. Certified Intel Cluster Ready platforms ensure the APIs, utilities, and file system structure are complete per the Intel Cluster Ready Specification; certified clusters are able to provide them by various means as they deem appropriate. Because of the clearly defined responsibilities ensuring the presence of all software required by registered applications, system providers have a high confidence that the certified clusters they build are able to run any certified applications their customers rely on. In addition to meeting the Intel Cluster Ready requirements, certi- fied clusters can also provide their added value, that is, other fea- tures and capabilities that increase the value of their products. How Does Intel Cluster Ready Accomplish its Objectives? At its heart, Intel Cluster Ready is a definition of the cluster as a parallel application platform, as well as a tool to certify an actual cluster to the definition. Let’s look at each of these in more de- tail, to understand their motivations and benefits. A definition of the cluster as parallel application platform The Intel Cluster Ready Specification is very much written as the requirements for, not the implementation of, a platform upon which parallel applications, more specifically MPI appli- cations, can be built and run. As such, the specification doesn’t care whether the cluster is diskful or diskless, fully distributed or single system image (SSI), built from “Enterprise” distributions or community distributions, fully open source or not. Perhaps more importantly, with one exception, the specification doesn’t have any requirements on how the cluster is built; that one exception is that compute nodes must be built with automated tools, so that new, repaired, or replaced nodes can be rebuilt identically IntelClusterReady A Quality Standard for HPC Clusters
    • 69 to the existing nodes without any manual interaction, other than possibly initiating the build process. Some items the specification does care about include: ʎʎ The ability to run both 32- and 64-bit applications, including MPI applications and X-clients, on any of the compute nodes ʎʎ Consistency among the compute nodes’ configuration, capa- bility, and performance ʎʎ The identical accessibility of libraries and tools across the cluster ʎʎ The identical access by each compute node to permanent and temporary storage, as well as users’ data ʎʎ The identical access to each compute node from the head node ʎʎ The MPI implementation provides fabric independence ʎʎ All nodes support network booting and provide a remotely accessible console The specification also requires that the runtimes for specific Intel software products are installed on every certified cluster:  Intel Math Kernel Library  Intel MPI Library Runtime Environment  Intel Threading Building Blocks This requirement does two things. First and foremost, main- line Linux distributions do not necessarily provide a sufficient software stack to build an HPC cluster – such specialization is beyond their mission. Secondly, the requirement ensures that programs built with this software will always work on certified clusters and enjoy simpler installations. As these runtimes are directly available from the web, the requirement does not cause additional costs to certified clusters. It is also very important to note that this does not require certified applications to use these libraries nor does it preclude alternate libraries, e.g., other MPI implementations, from being present on certified clusters. Quite clearly, an application that requires, e.g., an alternate MPI, must also provide the runtimes for that MPI as a part of its instal- lation. A tool to certify an actual cluster to the definition The Intel Cluster Checker, included with every certified Intel Clus- ter Ready implementation, is used in four modes in the life of a cluster: ʎʎ To certify a system provider’s prototype cluster as a valid im- plementation of the specification ʎʎ To verify to the owner that the just-delivered cluster is a “true copy” of the certified prototype ʎʎ To ensure the cluster remains fully functional, reducing ser- vice calls not related to the applications or the hardware ʎʎ To help software and system providers diagnose and correct actual problems to their code or their hardware. While these are critical capabilities, in all fairness, this greatly understates the capabilities of Intel Cluster Checker. The tool will not only verify the cluster is performing as expected. To do this, per-node and cluster-wide static and dynamic tests are made of the hardware and software. Intel Cluster Checker Cluster Definition & Configuration XML File STDOUT + Logfile Pass/Fail Results & Diagnostics Cluster Checker Engine Output Test Module Parallel Ops Check Result Config Output Result API Test Module Parallel Ops Check Config Node Node Node Node Node Node Node Node
    • 70 The static checks ensure the systems are configured consistent- ly and appropriately. As one example, the tool will ensure the systems are all running the same BIOS versions as well as hav- ing identical configurations among key BIOS settings. This type of problem – differing BIOS versions or settings – can be the root cause of subtle problems such as differing memory configura- tions that manifest themselves as differing memory bandwidths only to be seen at the application level as slower than expected overall performance. As is well-known, parallel program perfor- mance can be very much governed by the performance of the slowest components, not the fastest. In another static check, the Intel Cluster Checker will ensure that the expected tools, librar- ies, and files are present on each node, identically located on all nodes, as well as identically implemented on all nodes. This en- sures that each node has the minimal software stack specified by the specification, as well as identical software stack among the compute nodes. A typical dynamic check ensures consistent system perfor- mance, e.g., via the STREAM benchmark. This particular test en- sures processor and memory performance is consistent across compute nodes, which, like the BIOS setting example above, can be the root cause of overall slower application performance. An additional check with STREAM can be made if the user config- ures an expectation of benchmark performance; this check will ensure that performance is not only consistent across the clus- ter, but also meets expectations. Going beyond processor per- formance, the Intel MPI Benchmarks are used to ensure the net- work fabric(s) are performing properly and, with a configuration that describes expected performance levels, up to the cluster provider’s performance expectations. Network inconsistencies due to poorly performing Ethernet NICs, InfiniBand HBAs, faulty switches, and loose or faulty cables can be identified. Finally, the Intel Cluster Checker is extensible, enabling additional tests to be added supporting additional features and capabilities. This enables the Intel Cluster Checker to not only support the mini- IntelClusterReady Intel Cluster Ready Builds HPC Momentum
    • 71 Intel Cluster Ready Builds HPC Momentum With the Intel Cluster Ready (ICR) program, Intel Corporation set out to create a win-win scenario for the major constituencies in the high-performance computing (HPC) cluster market. Hard- ware vendors and independent software vendors (ISVs) stand to win by being able to ensure both buyers and users that their products will work well together straight out of the box. System administrators stand to win by being able to meet corporate de- mands to push HPC competitive advantages deeper into their organizations while satisfying end users’ demands for reliable HPC cycles, all without increasing IT staff. End users stand to win by being able to get their work done faster, with less downtime, on certified cluster platforms. Last but not least, with ICR, Intel has positioned itself to win by expanding the total addressable market (TAM) and reducing time to market for the company’s mi- croprocessors, chip sets, and platforms. The Worst of Times For a number of years, clusters were largely confined to govern- ment and academic sites, where contingents of graduate stu- dents and midlevel employees were available to help program and maintain the unwieldy early systems. Commercial firms lacked this low-cost labor supply and mistrusted the favored cluster operating system, open source Linux, on the grounds that no single party could be held accountable if something went wrong with it. Today, cluster penetration in the HPC mar- ket is deep and wide, extending from systems with a handful of processors to some of the world’s largest supercomputers, and from under $25,000 to tens or hundreds of millions of dollars in price. Clusters increasingly pervade every HPC vertical market: biosciences, computer-aided engineering, chemical engineering, digital content creation, economic/financial services, electronic design automation, geosciences/geo-engineering, mechanical mal requirements of the Intel Cluster Ready Specification, but the full cluster as delivered to the customer. Conforming hardware and software The preceding was primarily related to the builders of certified clusters and the developers of registered applications. For end users that want to purchase a certified cluster to run registered applications, the ability to identify registered applications and certified clusters is most important, as that will reduce their ef- fort to evaluate, acquire, and deploy the clusters that run their applications, and then keep that computing resource operating properly, with full performance, directly increasing their produc- tivity. Intel, XEON and certain other trademarks and logos appearing in this brochure, are trademarks or registered trademarks of Intel Corporation. Cluster Platform Software Tools Reference Designs Demand Creation Support & Training Specification Configuration Software Interconnect ServerPlatform IntelProcessor ISV Enabling © Intel Cluster Ready Program
    • 72 design, defense, government labs, academia, and weather/cli- mate. But IDC studies have consistently shown that clusters remain difficult to specify, deploy, and manage, especially for new and less experienced HPC users. This should come as no surprise, given that a cluster is a set of independent computers linked to- gether by software and networking technologies from multiple vendors. Clusters originated as do-it-yourself HPC systems. In the late 1990s users began employing inexpensive hardware to cobble together scientific computing systems based on the “Beowulf cluster” concept first developed by Thomas Sterling and Don- ald Becker at NASA. From their Beowulf origins, clusters have evolved and matured substantially, but the system manage- ment issues that plagued their early years remain in force to- day. The Need for Standard Cluster Solutions The escalating complexity of HPC clusters poses a dilemma for many large IT departments that cannot afford to scale up their HPC-knowledgeable staff to meet the fast-growing end-user de- mand for technical computing resources. Cluster management is even more problematic for smaller organizations and business units that often have no dedicated, HPC-knowledgeable staff to begin with. The ICR program aims to address burgeoning cluster complex- ity by making available a standard solution (aka reference archi- tecture) for Intel-based systems that hardware vendors can use to certify their configurations and that ISVs and other software vendors can use to test and register their applications, system IntelClusterReady Intel Cluster Ready Builds HPC Momentum
    • 73 software, and HPC management software. The chief goal of this voluntary compliance program is to ensure fundamental hard- ware-software integration and interoperability so that system administrators and end users can confidently purchase and de- ploy HPC clusters, and get their work done, even in cases where no HPC-knowledgeable staff are available to help. The ICR program wants to prevent end users from having to become, in effect, their own systems integrators. In smaller organizations, the ICR program is designed to allow over- worked IT departments with limited or no HPC expertise to support HPC user requirements more readily. For larger or- ganizations with dedicated HPC staff, ICR creates confidence that required user applications will work, eases the problem of system administration, and allows HPC cluster systems to be scaled up in size without scaling support staff. ICR can help drive HPC cluster resources deeper into larger organizations and free up IT staff to focus on mainstream enterprise appli- cations (e.g., payroll, sales, HR, and CRM). The program is a three-way collaboration among hardware ven- dors, software vendors, and Intel. In this triple alliance, Intel pro- vides the specification for the cluster architecture implementa- tion, and then vendors certify the hardware configurations and register software applications as compliant with the specifica- tion. The ICR program’s promise to system administrators and end users is that registered applications will run out of the box on certified hardware configurations. ICR solutions are compliant with the standard platform architec- ture, which starts with 64-bit Intel Xeon processors in an Intel- certified cluster hardware platform. Layered on top of this foun- dation are the interconnect fabric (Gigabit Ethernet, InfiniBand) and the software stack: Intel-selected Linux cluster tools, an Intel MPI runtime library, and the Intel Math Kernel Library. In- tel runtime components are available and verified as part of the certification (e.g., Intel tool runtimes) but are not required to be used by applications. The inclusion of these Intel runtime components does not exclude any other components a systems vendor or ISV might want to use. At the top of the stack are Intel- registered ISV applications. At the heart of the program is the Intel Cluster Checker, a vali- dation tool that verifies that a cluster is specification compliant and operational before ISV applications are ever loaded. After the cluster is up and running, the Cluster Checker can function as a fault isolation tool in wellness mode. Certification needs to happen only once for each distinct hardware platform, while verification – which determines whether a valid copy of the specification is operating – can be performed by the Cluster Checker at any time. Cluster Checker is an evolving tool that is designed to accept new test modules. It is a productized tool that ICR members ship with their systems. Cluster Checker originally was designed for homogeneous clusters but can now also be applied to clusters with specialized nodes, such as all-storage sub-clusters. Cluster Checker can isolate a wide range of problems, including network or communication problems.
    • transtec offers their customers a new and fascinating way to evaluate transtec’s HPC solutions in real world scenarios. With the transtec Benchmarking Center solutions can be explored in detail with the actual applications the customers will later run on them. Intel Cluster Ready makes this feasible by simplifying the maintenance of the systems and set-up of clean systems very eas- ily, and as often as needed. As High-Performance Computing (HPC) systems are utilized for numerical simulations, more and more advanced clustering technologies are being deployed. Because of its performance, price/performance and energy efficiency ad- vantages, clusters now dominate all segments of the HPC market and continue to gain acceptance. HPC computer systems have be- comefarmorewidespreadandpervasiveingovernment,industry, and academia. However, rarely does the client have the possibility to test their actual application on the system they are planning to acquire. The transtec Benchmarking Center transtec HPC solutions get used by a wide variety of clients. Among those are most of the large users of compute power at German and other European universities and research centers as well as governmental users like the German army’s compute 74 IntelClusterReady The transtec Benchmarking Center
    • center and clients from the high tech, the automotive and other sectors. transtec HPC solutions have demonstrated their value in more than 500 installations. Most of transtec’s cluster systems are based on SUSE Linux Enterprise Server, Red Hat Enterprise Linux, CentOS, or Scientific Linux. With xCAT for efficient cluster deploy- ment, and Moab HPC Suite by Adaptive Computing for high-level cluster management, transtec is able to efficiently deploy and ship easy-to-use HPC cluster solutions with enterprise-class management features. Moab has proven to provide easy-to-use workload and job management for small systems as well as the largest cluster installations worldwide. However, when selling clusters to governmental customers as well as other large en- terprises, it is often required that the client can choose from a range of competing offers. Many times there is a fixed budget available and competing solutions are compared based on their performance towards certain custom benchmark codes. So, in 2007 transtec decided to add another layer to their already wide array of competence in HPC – ranging from cluster deploy- ment and management, the latest CPU, board and network tech- nology to HPC storage systems. In transtec’s HPC Lab the systems are being assembled. transtec is using Intel Cluster Ready to fa- cilitate testing, verification, documentation, and final testing throughout the actual build process. At the benchmarking center transtec can now offer a set of small clusters with the “newest and hottest technology” through Intel Cluster Ready. A standard installation infrastructure gives transtec a quick and easy way to set systems up according to their customers’ choice of operating system, compilers, workload management suite, and so on. With Intel Cluster Ready there are prepared standard set-ups available with verified performance at standard benchmarks while the system stability is guaranteed by our own test suite and the Intel Cluster Checker. The Intel Cluster Ready program is designed to provide a com- mon standard for HPC clusters, helping organizations design and build seamless, compatible and consistent cluster configu- rations. Integrating the standards and tools provided by this pro- gram can help significantly simplify the deployment and man- agement of HPC clusters. 75
    • ParallelNFS– TheNewStandard forHPCStorage
    • 77 HPC computation results in the terabyte range are not uncommon. The problem in this context is not so much storing the data at rest, but the performance of the necessary copying back and forth in the course of the computation job flow and the dependent job turn- around time. For interim results during a job runtime or for fast stor- ageofinputandresultsdata, parallel file systems have established themselves as the standard to meet the ev- er-increasing performance requirements of HPC stor- age systems. Parallel NFS is about to become the new standard framework for a parallel file system.
    • 78 ParallelNFS The New Standard for HPC Storage Yesterday’s Solution: NFS for HPC Storage The original Network File System (NFS) developed by Sun Micro- systems at the end of the eighties – now available in version 4.1 – has been established for a long time as a de-facto standard for theprovisioningofaglobalnamespaceinnetworkedcomputing. A very widespread HPC cluster solution includes a central mater node acting simultaneously as an NFS server, with its local file system storing input, interim and results data and exporting them to all other cluster nodes. There is of course an immediate bottleneck in this method: When the load of the network is high, or where there are large num- bers of nodes, the NFS server can no longer keep up delivering or receiving the data. In high-performance computing especially, the nodes are interconnected at least once via Gigabit Ethernet, so the sum total throughput is well above what an NFS server with a Gigabit interface can achieve. Even a powerful network connection of the NFS server to the cluster, for example with 10-Gigabit Ethernet, is only a temporary solution to this prob- lem until the next cluster upgrade. The fundamental problem re- mains – this solution is not scalable; in addition, NFS is a difficult protocol to cluster in terms of load balancing: either you have to ensure that multiple NFS servers accessing the same data A classical NFS server is a bottleneck Cluster Nodes = NFS Clients NAS Head = NFS Server
    • 79 are constantly synchronised, the disadvantage being a notice- able drop in performance or you manually partition the global namespace which is also time-consuming. NFS is not suitable for dynamic load balancing as on paper it appears to be stateless but in reality is, in fact, stateful. Today’s Solution: Parallel File Systems For some time, powerful commercial products have been avail- able to meet the high demands on an HPC storage system. The open-source solutions FraunhoferFS (FhGFS) from the Fraun- hofer Competence Center for High Performance Computing or Lustre are widely used in the Linux HPC world, and also several other free as well as commercial parallel file system solutions exist. What is new is that the time-honoured NFS is to be upgraded, including a parallel version, into an Internet Standard with the aim of interoperability between all operating systems. The original problem statement for parallel NFS access was written by Garth Gibson, a professor at Carnegie Mellon University and founder and CTO of Panasas. Gibson was already a renowned figure being one of the authors contributing to the original paper on RAID architecture from 1988. The original statement from Gibson and Panasas is clearly noticeable in the design of pNFS. The powerful HPC file system developed by Gibson and Panasas, ActiveScale PanFS, with object-based storage devices functioning as central components, is basically the commercial continuation of the “Network-Attached Secure Disk (NASD)” project also developed by Garth Gibson at the Carnegie Mellon University. Parallel NFS Parallel NFS (pNFS) is gradually emerging as the future standard to meet requirements in the HPC environment. From the indus- try’s as well as the user’s perspective, the benefits of utilising standard solutions are indisputable: besides protecting end user investment, standards also ensure a defined level of interoper- ability without restricting the choice of products available. As a result, less user and administrator training is required which leads to simpler deployment and at the same time, a greater ac- ceptance. As part of the NFS 4.1 Internet Standard, pNFS will not only adopt the semantics of NFS in terms of cache consistency or security, it also represents an easy and flexible extension of the NFS 4 protocol. pNFS is optional, in other words, NFS 4.1 implementations do not have to include pNFS as a feature. The scheduled Internet Standard NFS 4.1 is today presented as IETF RFC 5661. The pNFS protocol supports a separation of metadata and data: a pNFS cluster comprises so-called storage devices which store the data from the shared file system and a metadata server (MDS), called Director Blade with Panasas – the actual NFS 4.1 server. The metadata server keeps track of which data is stored on which storage devices and how to access the files, the so- called layout. Besides these “striping parameters”, the MDS also manages other metadata including access rights or similar, which is usually stored in a file’s inode. The layout types define which Storage Access Protocol is used by the clients to access the storage devices. Up until now, three potential storage access protocols have been defined for pNFS: file, block and object-based layouts, the former be- ing described in RFC 5661 direct, the latters in RFC 5663 and 5664, respectively. Last but not least, a Control Protocol is also used by the MDS and storage devices to synchronise status data. This protocol is deliberately unspecified in the standard to give manufacturers certain flexibility. The NFS 4.1 standard does however specify certain conditions which a control pro- tocol has to fulfil, for example, how to deal with the change/ modify time attributes of files.
    • 80 What’s New in NFS 4.1? NFS 4.1 is a minor update to NFS 4, and adds new features to it. One of the optional features is parallel NFS (pNFS) but there is other new functionality as well. One of the technical enhancements is the use of sessions, a persistent server object, dynamically created by the client. By means of sessions, the state of an NFS connection can be stored, no matter whether the connection is live or not. Sessions survive temporary downtimes both of the client and the server. Each session has a so-called fore channel, which is the connec- tion connection from the client to the server for all RPC opera- tions, and optionally a back channel for RPC callbacks from the server that now can also be realized through firewall boundar- ies. Sessions can be trunked to increase the bandwidth. Besides session trunking there is also a client ID trunking for grouping together several sessions to the same client ID. By means of sessions, NFS can be seen as a really stateful proto- col with a so-called “Exactly-Once Semantics (EOS)”. Until now, a necessary but unspecified reply cache within the NFS server is implemented to handle identical RPC operations that have been sent several times. This statefulness in reality is not very robust, however, and sometimes leads to the well-known stale NFS han- dles. In NFS 4.1, the reply cache is now a mandatory part of the NFS implementation, storing the server replies to RPC requests persistently on disk. Another new feature of NFS 4.1 is delegation for directories: NFS clients can be given temporary exclusive access to directories. Before, this has only been possible for simple files. With the forthcoming version 4.2 of the NFS standard, federated filesys- tems will be added as a feature, which represents the NFS coun- terpart of Microsoft’s DFS (distributed filesystem). ParallelNFS What’s New in NFS 4.1?
    • 81 pNFS supports backwards compatibility with non-pNFS com- patible NFS 4 clients. In this case, the MDS itself gathers data from the storage devices on behalf of the NFS client and pres- ents the data to the NFS client via NFS 4. The MDS acts as a kind of proxy server – which is e.g. what the Director Blades from Panasas do. pNFS Layout Types If storage devices act simply as NFS 4 file servers, the file layout is used. It is the only storage access protocol directly specified in the NFS 4.1 standard. Besides the stripe sizes and stripe locations (storage devices), it also includes the NFS file handles which the client needs to use to access the separate file areas. The file layout is compact and static, the striping information does not change even if changes are made to the file enabling multiple pNFS clients to simultaneously cache the layout and avoid synchronisation overhead between cli- ents and the MDS or the MDS and storage devices. File system authorisation and client authentication can be well implemented with the file layout. When using NFS 4 as the storage access protocol, client authentication merely de- pends on the security flavor used – when using the RPCSEC_ GSS security flavor, client access is kerberized, for example and the server controls access authorization using specified ACLs and cryptographic processes. In contrast, the block/volume layout uses volume identifiers and block offsets and extents to specify a file layout. SCSI block commands are used to access storage devices. As the block distribution can change with each write access, the lay- out must be updated more frequently than with the file lay- out. Block-based access to storage devices does not offer any se- cure authentication option for the accessing SCSI initiator. Secure SAN authorisation is possible with host granularity only, based on World Wide Names (WWNs) with Fibre Channel or Initiator Node Names (IQNs) with iSCSI. The server cannot enforce access control governed by the file system. On the contrary, a pNFS client basically voluntarily abides with the access rights, the storage device has to trust the pNFS client – a fundamental access control problem that is a recurrent is- sue in the NFS protocol history. The object layout is syntactically similar to the file layout, but it uses the SCSI object command set for data access to so-called Object-based Storage Devices (OSDs) and is heavily based on the DirectFLOW protocol of the ActiveScale PanFS from Panasas. From the very start, Object-Based Storage De- vices were designed for secure authentication and access. So- called capabilities are used for object access which involves the MDS issuing so-called capabilities to the pNFS clients. The ownership of these capabilities represents the authoritative access right to an object. pNFS can be upgraded to integrate other storage access pro- tocols and operating systems, and storage manufacturers also have the option to ship additional layout drivers for their pNFS implementations. Parallel NFS pNFS Clients Storage Device Metadata Server Storage Access Protocol NFS 4.1 Control Protocol
    • 82 ParallelNFS Panasas HPC Storage Having a many years’ experience in deploying parallel file sys- tems like Lustre or FraunhoferFS (FhGFS) from smaller scales up to hundreds of terabytes capacity and throughputs of several gigabytes per second, transtec chose Panasas, the leader in HPC storage solutions, as the partner for providing highest perfor- mance and scalability on the one hand, and ease of manage- ment on the other. Therefore, with Panasas as the technology leader, and transtec’s overall experience and customer-oriented approach, customers can be assured to get the best possible HPC storage solution available.
    • 83 The Panasas file system uses parallel and redundant access to object storage devices (OSDs), per-file RAID, distributed metada- ta management, consistent client caching, file locking services, and internal cluster management to provide a scalable, fault tolerant, high performance distributed file system. The clustered design of the storage system and the use of client-driven RAID provide scalable performance to many concurrent file system clients through parallel access to file data that is striped across OSD storage nodes. RAID recovery is performed in parallel by the cluster of metadata managers, and declustered data placement yields scalable RAID rebuild rates as the storage system grows larger. Introduction Storage systems for high performance computing environments must be designed to scale in performance so that they can be configured to match the required load. Clustering techniques are often used to provide scalability. In a storage cluster, many nodes each control some storage, and the overall distributed file system assembles the cluster elements into one large, seamless storage system. The storage cluster can be hosted on the same computers that perform data processing, or they can be a sepa- rate cluster that is devoted entirely to storage and accessible to the compute cluster via a network protocol. The Panasas storage system is a specialized storage cluster, and this paper presents its design and a number of performance measurements to illustrate the scalability. The Panasas system is a production system that provides file service to some of the largest compute clusters in the world, in scientific labs, in seis- mic data processing, in digital animation studios, in computa- tional fluid dynamics, in semiconductor manufacturing, and in general purpose computing environments. In these environ- ments, hundreds or thousands of file system clients share data and generate very high aggregate I/O load on the file system. The Panasas system is designed to support several thousand clients and storage capacities in excess of a petabyte. The unique aspects of the Panasas system are its use of per-file, client-driven RAID, its parallel RAID rebuild, its treatment of dif- ferent classes of metadata (block, file, system) and a commodity parts based blade hardware with integrated UPS. Of course, the system has many other features (such as object storage, fault tol- erance, caching and cache consistency, and a simplified manage- ment model) that are not unique, but are necessary for a scalable system implementation. Panasas File System Background The two overall themes to the system are object storage, which affects how the file system manages its data, and clustering of components, which allows the system to scale in performance and capacity. Object Storage Anobjectisacontainerfordataandattributes;itisanalogoustothe inode inside a traditional UNIX file system implementation. Special- izedstoragenodescalledObjectStorageDevices(OSD)storeobjects in a local OSDFS file system. The object interface addresses objects in a two-level (partition ID/object ID) namespace. The OSD wire pro- tocol provides byteoriented access to the data, attribute manipula- tion, creation and deletion of objects, and several other specialized operations.PanasasusesaniSCSItransporttocarryOSDcommands that are very similar to the OSDv2 standard currently in progress within SNIA and ANSI-T10. The Panasas file system is layered over the object storage. Each file is striped over two or more objects to provide redundancy and high bandwidth access. The file system semantics are implemented by metadata managers that mediate access to objects from clients of the file system. The clients access the object storage using the iSCSI/OSD protocol for Read and Write operations. The I/O op- erations proceed directly and in parallel to the storage nodes, bypassing the metadata managers. The clients interact with the out-of-band metadata managers via RPC to obtain access capa- bilities and location information for the objects that store files.
    • 84 Object attributes are used to store file-level attributes, and direc- tories are implemented with objects that store name to object ID mappings. Thus the file system metadata is kept in the object store itself, rather than being kept in a separate database or some other form of storage on the metadata nodes. System Software Components The major software subsystems are the OSDFS object storage system, the Panasas file system metadata manager, the Panasas file system client, the NFS/CIFS gateway, and the overall cluster management system. ʎʎ ThePanasasclientisaninstallablekernelmodulethatrunsinside the Linux kernel. The kernel module implements the standard VFS interface, so that the client hosts can mount the file system ParallelNFS Panasas HPC Storage NFS/CIFS OSDFS SysMgr COMPUTE NODE MANAGER NODE STORAGE NODE iSCSI/ OSD PanFSClient RPC Client Panasas System Components
    • 85 and use a POSIX interface to the storage system. ʎʎ Each storage cluster node runs a common platform that is based on FreeBSD, with additional services to provide hardware moni- toring, configuration management, and overall control. ʎʎ The storage nodes use a specialized local file system (OSDFS) that implements the object storage primitives. They imple- ment an iSCSI target and the OSD command set. The OSDFS object store and iSCSI target/OSD command processor are kernel modules. OSDFS is concerned with traditional block- level file system issues such as efficient disk arm utilization, media management (i.e., error handling), high throughput, as well as the OSD interface. ʎʎ Theclustermanager(SysMgr)maintainstheglobalconfiguration, and it controls the other services and nodes in the storage clus- ter. There is an associated management application that provi- des both a command line interface (CLI) and an HTML interface (GUI). These are all user level applications that run on a subset of the manager nodes. The cluster manager is concerned with membershipinthestoragecluster,faultdetection,configuration management,andoverallcontrolforoperationslikesoftwareup- grade and system restart. ʎʎ The Panasas metadata manager (PanFS) implements the file system semantics and manages data striping across the object storage devices. This is a user level application that runs on every manager node. The metadata manager is concerned with distri- buted file system issues such as secure multi-user access, main- taining consistent file- and object-level metadata, client cache coherency,andrecoveryfromclient,storagenode,andmetadata server crashes. Fault tolerance is based on a local transaction log that is replicated to a backup on a different manager node. ʎʎ The NFS and CIFS services provide access to the file system for hosts that cannot use our Linux installable file system client. The NFS service is a tuned version of the standard FreeBSD NFS server that runs inside the kernel. The CIFS ser- vice is based on Samba and runs at user level. In turn, these services use a local instance of the file system client, which runs inside the FreeBSD kernel. These gateway services run on every manager node to provide a clustered NFS and CIFS service. Commodity Hardware Platform The storage cluster nodes are implemented as blades that are very compact computer systems made from commodity parts. The blades are clustered together to provide a scalable platform. The OSD StorageBlade module and metadata manager Director- Blade module use the same form factor blade and fit into the same chassis slots. Storage Management Traditional storage management tasks involve partitioning avail- able storage space into LUNs (i.e., logical units that are one or more disks, or a subset of a RAID array), assigning LUN ownership to dif- ferent hosts, configuring RAID parameters, creating file systems or databases on LUNs, and connecting clients to the correct server for their storage. This can be a labor-intensive scenario. Panasas pro- vides a simplified model for storage management that shields the storage administrator from these kinds of details and allow a single, part-time admin to manage systems that were hundreds of tera- bytes in size. The Panasas storage system presents itself as a file system with a POSIX interface, and hides most of the complexities of storage man- agement. Clients have a single mount point for the entire system. The/etc/fstabfilereferencestheclustermanager,andfromthatthe client learns the location of the metadata service instances. The ad- ministrator can add storage while the system is online, and new re- sources are automatically discovered. To manage available storage, Panasas introduces two basic storage concepts: a physical storage pool called a BladeSet, and a logical quota tree called a Volume. The BladeSet is a collection of StorageBlade modules in one or more shelves that comprise a RAID fault domain. Panasas mitigates the risk of large fault domains with the scalable rebuild performance described below in the text. The BladeSet is a hard physical bound-
    • 86 aryforthevolumesitcontains.ABladeSetcanbegrownatanytime, either by adding more StorageBlade modules, or by merging two ex- isting BladeSets together. The Volume is a directory hierarchy that has a quota constraint and isassignedtoaparticularBladeSet.Thequotacanbechangedatany time, and capacity is not allocated to the Volume until it is used, so multiplevolumescompeteforspacewithintheirBladeSetandgrow on demand. The files in those volumes are distributed among all the StorageBlade modules in the BladeSet. Volumes appear in the file system name space as directories. Clients have a single mount point for the whole storage sys- tem, and volumes are simply directories below the mount point. There is no need to update client mounts when the admin cre- ates, deletes, or renames volumes. Automatic Capacity Balancing Capacity imbalance occurs when expanding a BladeSet (i.e., add- ing new, empty storage nodes), merging two BladeSets, and re- placing a storage node following a failure. In the latter scenario, the imbalance is the result of the RAID rebuild, which uses spare capacity on every storage node rather than dedicating a specific “hot spare” node. This provides better throughput during rebuild, but causes the system to have a new, empty storage node after the failed storage node is replaced. The system automatically bal- ances used capacity across storage nodes in a BladeSet using two mechanisms: passive balancing and active balancing. Passive balancing changes the probability that a storage node will be used for a new component of a file, based on its available capac- ity.Thistakeseffectwhenfilesarecreated,andwhentheirstripesize isincreasedtoincludemorestoragenodes.Activebalancingisdone by moving an existing component object from one storage node to another, and updating the storage map for the affected file. During the transfer, the file is transparently marked read-only by the stor- agemanagementlayer,andthecapacitybalancerskipsfilesthatare being actively written. Capacity balancing is thus transparent to file system clients. ParallelNFS Panasas HPC Storage ©2013 Panasas Incorporated. All rights reserved. Panasas, the Panasas logo, Accelerating Time to Results, ActiveScale, Direct- FLOW, DirectorBlade, StorageBlade, PanFS, PanActive and MyPa- nasas are trademarks or registered trademarks of Panasas, Inc. in the United States and other countries. All other trademarks are the property of their respective owners. Information sup- plied by Panasas, Inc. is believed to be accurate and reliable at the time of publication, but Panasas, Inc. assumes no responsi- bility for any errors that may appear in this document. Panasas, Inc. reserves the right, without notice, to make changes in prod- uct design, specifications and prices. Information is subject to change without notice.
    • 87 Object RAID and Reconstruction Panasas protects against loss of a data object or an entire storage node by striping files across objects stored on different storage nodes, using a fault-tolerant striping algorithm such as RAID-1 or RAID-5. Small files are mirrored on two objects, and larger files are striped more widely to provide higher bandwidth and less capacity overhead from parity information. The per-file RAID layout means that parity information for different files is not mixed together, and easily allows different files to use different RAID schemes alongside each other. This property and the security mechanisms of the OSD protocol makes it possible to enforce access control over files even as clients access storage nodes directly. It also enables what is per- haps the most novel aspect of our system, client-driven RAID. That is, the clients are responsible for computing and writing parity. The OSD security mechanism also allows multiple metadata managers tomanageobjectsonthesamestoragedevicewithoutheavyweight coordinationorinterferencefromeachother. Client-driven, per-file RAID has four advantages for large-scale stor- age systems. First, by having clients compute parity for their own data,theXORpowerofthesystemscalesupasthenumberofclients increases. We measured XOR processing during streaming write bandwidth loads at 7% of the client’s CPU, with the rest going to the OSD/iSCSI/TCP/IP stack and other file system overhead. Moving XOR computationoutofthestoragesystemintotheclientrequiressome additional work to handle failures. Clients are responsible for gener- ating good data and good parity for it. Because the RAID equation is per-file, an errant client can only damage its own data. However, if a clientfailsduringawrite,themetadatamanagerwillscrubparityto ensure the parity equation is correct. The second advantage of client-driven RAID is that clients can per- form an end-to-end data integrity check. Data has to go through the disk subsystem, through the network interface on the storage nodes, through the network and routers, through the NIC on the client, and all of these transits can introduce errors with a very low probability. Clients can choose to read parity as well as data, and verify parity as part of a read operation. If errors are detected, the operationisretried.Iftheerrorispersistent,analertisraisedandthe read operation fails. By checking parity across storage nodes within the client, the system can ensure end-to-end data integrity. This is another novel property of per-file, client-driven RAID. Third, per-file RAID protection lets the metadata managers rebuild files in parallel. Although parallel rebuild is theoretically possible in block-basedRAID,itisrarelyimplemented.Thisisduetothefactthat the disks are owned by a single RAID controller, even in dual-ported configurations. Large storage systems have multiple RAID control- lers that are not interconnected. Since the SCSI Block command set does not provide fine-grained synchronization operations, it is dif- ficult for multiple RAID controllers to coordinate a complicated op- eration such as an online rebuild without external communication. Even if they could, without connectivity to the disks in the affected parity group, other RAID controllers would be unable to assist. Even in a high-availability configuration, each disk is typically only at- tached to two different RAID controllers, which limits the potential speedup to 2x. When a StorageBlade module fails, the metadata managers that own Volumes within that BladeSet determine what files are affect- ed, and then they farm out file reconstruction work to every other metadata manager in the system. Metadata managers rebuild their own files first, but if they finish early or do not own any Volumes in theaffectedBladeset,theyarefreetoaidothermetadatamanagers. Declustered parity groups spread out the I/O workload among all StorageBlade modules in the BladeSet. The result is that larger stor- age clusters reconstruct lost data more quickly. The fourth advantage of per-file RAID is that unrecoverable faults can be constrained to individual files. The most commonly en- countered double-failure scenario with RAID-5 is an unrecover- able read error (i.e., grown media defect) during the reconstruc- tion of a failed storage device. The second storage device is still healthy, but it has been unable to read a sector, which prevents rebuild of the sector lost from the first drive and potentially the entire stripe or LUN, depending on the design of the RAID con- troller. With block-based RAID, it is difficult or impossible to di-
    • 88 rectly map any lost sectors back to higher-level file system data structures, so a full file system check and media scan will be re- quired to locate and repair the damage. A more typical response is to fail the rebuild entirely. RAID controllers monitor drives in an effort to scrub out media defects and avoid this bad scenario, and the Panasas system does media scrubbing, too. However, with high capacity SATA drives, the chance of encountering a me- dia defect on drive B while rebuilding drive A is still significant. With per-file RAID-5, this sort of double failure means that only a single file is lost, and the specific file can be easily identified and reported to the administrator. While block-based RAID sys- tems have been compelled to introduce RAID-6 (i.e., fault tolerant schemes that handle two failures), the Panasas solution is able to deploy highly reliable RAID-5 systems with large, high perfor- mance storage pools. RAID Rebuild Performance RAID rebuild performance determines how quickly the system can recoverdatawhenastoragenodeislost.Shortrebuildtimesreduce thewindowinwhichasecondfailurecancausedataloss.Thereare three techniques to reduce rebuild times: reducing the size of the RAID parity group, declustering the placement of parity group ele- ments, and rebuilding files in parallel using multiple RAID engines. ParallelNFS Panasas HPC Storage Declustered parity groups b C f h J K C D e G i m A D e h i m A C D G J K b C f G i m A e f h J K b D e G i m A b f h j K
    • 89 The rebuild bandwidth is the rate at which reconstructed data is written to the system when a storage node is being reconstruct- ed. The system must read N times as much as it writes, depend- ing on the width of the RAID parity group, so the overall through- put of the storage system is several times higher than the rebuild rate. A narrower RAID parity group requires fewer read and XOR operations to rebuild, so will result in a higher rebuild band- width. However, it also results in higher capacity overhead for parity data, and can limit bandwidth during normal I/O. Thus, selection of the RAID parity group size is a trade-off between ca- pacity overhead, on-line performance, and rebuild performance. Understanding declustering is easier with a picture. In the figure on the left, each parity group has 4 elements, which are indicat- ed by letters placed in each storage device. They are distributed among 8 storage devices. The ratio between the parity group size and the available storage devices is the declustering ratio, which in this example is ½. In the picture, capital letters represent those parity groups that all share the second storage node. If the second storage device were to fail, the system would have to read the sur- viving members of its parity groups to rebuild the lost elements. You can see that the other elements of those parity groups occupy about ½ of each other storage device. For this simple example you can assume each parity element is the same size so all the devices are filled equally. In a real system, the componentobjectswillhavevarioussizesdependingontheoverall file size, although each member of a parity group will be very close in size. There will be thousands or millions of objects on each de- vice,andthePanasassystemusesactivebalancingtomovecompo- nent objects between storage nodes to level capacity. Declustering means that rebuild requires reading a subset of each device, with the proportion being approximately the same as the declustering ratio. The total amount of data read is the same with and without declustering, but with decluster- ing it is spread out over more devices. When writing the re- constructed elements, two elements of the same parity group cannot be located on the same storage node. Declustering leaves many storage devices available for the reconstructed parity element, and randomizing the placement of each file’s parity group lets the system spread out the write I/O over all the storage. Thus declustering RAID parity groups has the im- portant property of taking a fixed amount of rebuild I/O and spreading it out over more storage devices. Having per-file RAID allows the Panasas system to divide the work among the available DirectorBlade modules by assign- ing different files to different DirectorBlade modules. This di- vision is dynamic with a simple master/worker model in which metadata services make themselves available as workers, and each metadata service acts as the master for the volumes it implements. By doing rebuilds in parallel on all DirectorBlade modules, the system can apply more XOR throughput and uti- lize the additional I/O bandwidth obtained with declustering. Metadata Management There are several kinds of metadata in the Panasas system. These include the mapping from object IDs to sets of block addresses, mapping files to sets of objects, file system attributes such as ACLs and owners, file system namespace information (i.e., direc- tories), and configuration/management information about the storage cluster itself. Block-level Metadata Block-level metadata is managed internally by OSDFS, the file system that is optimized to store objects. OSDFS uses a floating block allocation scheme where data, block pointers, and object descriptors are batched into large write operations. The write buffer is protected by the integrated UPS, and it is flushed to disk on power failure or system panics. Fragmentation was an issue in early versions of OSDFS that used a first-fit block allocator, but this has been significantly mitigated in later versions that use a modified best-fit allocator. OSDFS stores higher level file system data structures, such as the partition and object tables, in a modified BTree data structure.
    • 90 Block mapping for each object uses a traditional direct/indirect/ double-indirect scheme. Free blocks are tracked by a proprietary bitmap-like data structure that is optimized for copy-on-write reference counting, part of OSDFS’s integrated support for ob- ject- and partition-level copy-on-write snapshots. Block-level metadata management consumes most of the cycles in file system implementations. By delegating storage manage- ment to OSDFS, the Panasas metadata managers have an order of magnitude less work to do than the equivalent SAN file system metadata manager that must track all the blocks in the system. File-level Metadata Above the block layer is the metadata about files. This includes user-visible information such as the owner, size, and modifica- tion time, as well as internal information that identifies which objects store the file and how the data is striped across those objects (i.e., the file’s storage map). Our system stores this file metadata in object attributes on two of the N objects used to store the file’s data. The rest of the objects have basic at- tributes like their individual length and modify times, but the higher-level file system attributes are only stored on the two attribute-storing components. File names are implemented in directories similar to traditional UNIX file systems. Directories are special files that store an ar- ray of directory entries. A directory entry identifies a file with a tuple of <serviceID, partitionID, objectID>, and also includes two <osdID> fields that are hints about the location of the at- tribute storing components. The partitionID/objectID is the two-level object numbering scheme of the OSD interface, and Panasas uses a partition for each volume. Directories are mir- rored (RAID-1) in two objects so that the small write operations associated with directory updates are efficient. Clients are allowed to read, cache and parse directories, or they can use a Lookup RPC to the metadata manager to trans- late a name to an <serviceID, partitionID, objectID> tuple and the <osdID> location hints. The serviceID provides a hint about ParallelNFS Panasas HPC Storage
    • 91 the metadata manager for the file, although clients may be re- directed to the metadata manager that currently controls the file. The osdID hint can become out-of-date if reconstruction or active balancing moves an object. If both osdID hints fail, the metadata manager has to multicast a GetAttributes to the stor- age nodes in the BladeSet to locate an object. The partitionID and objectID are the same on every storage node that stores a component of the file, so this technique will always work. Once the file is located, the metadata manager automatically updates the stored hints in the directory, allowing future ac- cesses to bypass this step. File operations may require several object operations. The figure on the left shows the steps used in creating a file. The metadata manager keeps a local journal to record in-progress actions so it can recover from object failures and metadata manager crashes that occur when updating multiple objects. For example, creat- ing a file is fairly complex task that requires updating the parent directory as well as creating the new file. There are 2 Create OSD operations to create the first two components of the file, and 2 Write OSD operations, one to each replica of the parent direc- tory. As a performance optimization, the metadata server also grants the client read and write access to the file and returns the appropriate capabilities to the client as part of the FileCre- ate results. The server makes record of these write capabilities to support error recovery if the client crashes while writing the file. Note that the directory update (step 7) occurs after the reply, so that many directory updates can be batched together. The de- ferred update is protected by the op-log record that gets deleted in step 8 after the successful directory update. The metadata manager maintains an op-log that records the object create and the directory updates that are in progress. This log entry is removed when the operation is complete. If the metadata service crashes and restarts, or a failure event moves the metadata service to a different manager node, then the op- log is processed to determine what operations were active at the time of the failure. The metadata manager rolls the opera- tions forward or backward to ensure the object store is consis- tent. If no reply to the operation has been generated, then the operation is rolled back. If a reply has been generated but pend- ing operations are outstanding (e.g., directory updates), then the operation is rolled forward. The write capability is stored in a cap-log so that when a meta- data server starts it knows which of its files are busy. In addi- tion to the “piggybacked” write capability returned by FileC- reate, the client can also execute a StartWrite RPC to obtain a separate write capability. The cap-log entry is removed when the client releases the write cap via an EndWrite RPC. If the client reports an error during its I/O, then a repair log entry is made and the file is scheduled for repair. Read and write capa- bilities are cached by the client over multiple system calls, fur- ther reducing metadata server traffic. Creating a file OSDs 1. CREATE 7. WRITE3. CREATE Metadata Server 6. REPLY Client Txn_log 2 8 4 5 oplog caplog Reply cache
    • 92 System-level Metadata The final layer of metadata is information about the overall sys- tem itself. One possibility would be to store this information in objects and bootstrap the system through a discovery protocol. The most difficult aspect of that approach is reasoning about the fault model. The system must be able to come up and be manageable while it is only partially functional. Panasas chose instead a model with a small replicated set of system managers, each that stores a replica of the system configuration metadata. Each system manager maintains a local database, outside of the object storage system. Berkeley DB is used to store tables that represent our system model. The different system manager in- stances are members of a replication set that use Lamport’s part- time parliament (PTP) protocol to make decisions and update the configuration information. Clusters are configured with one, three, or five system managers so that the voting quorum has an odd number and a network partition will cause a minority of system managers to disable themselves. System configuration state includes both static state, such as the identity of the blades in the system, as well as dynamic state such as the online/offline state of various services and error conditions associated with different system components. Each state update decision, whether it is updating the admin password or activating a service, involves a voting round and an update round according to the PTP protocol. Database updates are performed within the PTP transactions to keep the databases synchronized. Finally, the sys- tem keeps backup copies of the system configuration databases on several other blades to guard against catastrophic loss of every sys- tem manager blade. Blade configuration is pulled from the system managers as part of each blade’s startup sequence. The initial DHCP handshake conveys the addresses of the system managers, and thereafter the local OS on each blade pulls configuration information from the system managers via RPC. Theclustermanagerimplementationhastwolayers.Thelowerlevel PTP layer manages the voting rounds and ensures that partitioned ParallelNFS Panasas HPC Storage “Outstanding in the HPC world, the ActiveStor so- lutions provided by Panasas are undoubtedly the only HPC storage solutions that combine highest scalability and performance with a convincing ease of management.“ Thomas Gebert HPC Solution Architect
    • 93 or newly added system managers will be brought up-to-date with the quorum. The application layer above that uses the voting and update interface to make decisions. Complex system operations mayinvolveseveralsteps,andthesystemmanagerhastokeeptrack of its progress so it can tolerate a crash and roll back or roll forward as appropriate. Forexample,creatingavolume(i.e.,aquota-tree)involvesfilesystem operations to create a top-level directory, object operations to cre- ate an object partition within OSDFS on each StorageBlade module, service operations to activate the appropriate metadata manager, andconfigurationdatabaseoperationstoreflecttheadditionofthe volume. Recovery is enabled by having two PTP transactions. The initial PTP transaction determines if the volume should be created, and it creates a record about the volume that is marked as incom- plete. Then the system manager does all the necessary service acti- vations,fileandstorageoperations.Whentheseallcomplete,afinal PTP transaction is performed to commit the operation. If the system manager crashes before the final PTP transaction, it will detect the incomplete operation the next time it restarts, and then roll the op- eration forward or backward.
    • 95 Tandis que tous les systèmes HPC se trouvent confrontés à de grands challenges en matière de complexité, d’évolutivité et d’exigences des flux applicatifs et des ressources, les systèmes HPC des entreprises font face aux défis et attentes les plus exigeants. Les systèmes HPC d’entreprises doivent satisfaire aux exigences critiques et prioritaires des flux d’applications pour les entreprises commerciales et les organisations de recherche et académiques à orientation commerciale. Ils doi- vent trouver un équilibre entre des SLA et des priorités com- plexes. En effet, leurs flux HPC ont un impact direct sur leurs revenus, la livraison de leurs produits et leurs objectifs. 95 The CUDA parallel comput- ing platform is now widely deployed with 1000s of GPU- accelerated applications and 1000s of published re- search papers, and a com- plete range of CUDA tools and ecosystem solutions is available to developers.
    • 96 What is GPU Computing? GPU computing is the use of a GPU (graphics processing unit) together with a CPU to accelerate general-purpose scientific and engineering applications. Pioneered more five years ago by NVIDIA, GPU computing has quickly become an industry stan- dard, enjoyed by millions of users worldwide and adopted by virtually all computing vendors. GPU computing offers unprecedented application performance by offloading compute-intensive portions of the application to the GPU, while the remainder of the code still runs on the CPU. From a user’s perspective, applications simply run significantly faster. CPU + GPU is a powerful combination because CPUs consist of a few cores optimized for serial processing, while GPUs consist of thousands of smaller, more efficient cores designed for parallel performance. Serial portions of the code run on the CPU while parallel portions run on the GPU. Most customers can immediately enjoy the power of GPU com- puting by using any of the GPU-accelerated applications listed in our catalog, which highlights over one hundred, industry-lead- ing applications. For developers, GPU computing offers a vast ecosystem of tools and libraries from major software vendors “We are very proud to be one of the leading provid- ers of Tesla systems who are able to combine the overwhelming power of NVIDIA Tesla systems with the fully engineered and thoroughly tested trans- tec hardware to a total Tesla-based solution.” Norbert Zeidler Senior HPC Solution Engineer NVIDIAGPUComputing What is GPU Computing?
    • 97 History of GPU Computing Graphics chips started as fixed-function graphics processors but became increasingly programmable and computationally pow- erful, which led NVIDIA to introduce the first GPU. In the 1999- 2000 timeframe, computer scientists and domain scientists from various fields started using GPUs to accelerate a range of scien- tific applications. This was the advent of the movement called GPGPU, or General-Purpose computation on GPU. While users achieved unprecedented performance (over 100x compared to CPUs in some cases), the challenge was that GPGPU required the use of graphics programming APIs like OpenGL and Cg to program the GPU. This limited accessibility to the tremen- dous capability of GPUs for science. NVIDIA recognized the potential of bringing this performance for the larger scientific community, invested in making the GPU fully programmable, and offered seamless experience for devel- opers with familiar languages like C, C++, and Fortran. GPU computing momentum is growing faster than ever before. Today, some of the fastest supercomputers in the world rely on GPUs to advance scientific discoveries; 600 universities around the world teach parallel computing with NVIDIA GPUs; and hun- dreds of thousands of developers are actively using GPUs. All NVIDIA GPUs—GeForce, Quadro, and Tesla — support GPU computing and the CUDA parallel programming model. Develop- ers have access to NVIDIA GPUs in virtually any platform of their choice, including the latest Apple MacBook Pro. However, we recommend Tesla GPUs for workloads where data reliability and overall performance are critical. Tesla GPUs are designed from the ground-up to accelerate sci- entific and technical computing workloads. Based on innovative features in the “Kepler architecture,” the latest Tesla GPUs offer 3x more performance compared to the previous architecture, more than one teraflops of double-precision floating point while dramatically advancing programmability and efficiency. Kepler is the world’s fastest and most efficient high performance com- puting (HPC) architecture. Kepler GK110 – The Next Generation GPU Computing Architecture As the demand for high performance parallel computing increas- es across many areas of science, medicine, engineering, and fi- nance, NVIDIA continues to innovate and meet that demand with extraordinarily powerful GPU computing architectures. NVIDIA’s existing Fermi GPUs have already redefined and accel- GPU Computing Applications NVIDIA GPU with the CUDA Parallel Computing Architecture C C++ OpenCL FORTRANDirectX Compute Java and Python The CUDA parallel architecure the cuda programming model 1 2 3 4 CUDA Parallel Compute Engines inside NVIDIA GPUs CUDA Support in OS Kernel CUDA Driver Applications Using DirectX DirectX Compute OpenCL Driver C Runtime for CUDA Applications Using OpenCL Applications Using the CUDA Driver API Applications Using C, C++, Fortran, Java, Python HLSL Compute Shaders OpenCL C Compute Kernels C for CUDA Compute Kernels PTX (ISA) C for CUDA Compute Functions Device-level APIs Language Integration
    • 98 erated High Performance Computing (HPC) capabilities in areas such as seismic processing, biochemistry simulations, weather and climate modeling, signal processing, computational finance, computer aided engineering, computational fluid dynamics, and data analysis. NVIDIA’s new Kepler GK110 GPU raises the parallel computing bar considerably and will help solve the world’s most difficult computing problems. By offering much higher processing power than the prior GPU generation and by providing new methods to optimize and in- crease parallel workload execution on the GPU, Kepler GK110 simplifies creation of parallel programs and will further revolu- tionize high performance computing. Kepler GK110 – Extreme Performance, Extreme Efficiency Comprising 7.1 billion transistors, Kepler GK110 is not only the fastest, but also the most architecturally complex microproces- sor ever built. Adding many new innovative features focused on compute performance, GK110 was designed to be a parallel pro- cessing powerhouse for Tesla® and the HPC market. Kepler GK110 provides over 1 TFlop of double precision throughput with greater than 93% DGEMM efficiency versus 60-65% on the prior Fermi architecture. In addition to greatly improved performance, the Kepler archi- tecture offers a huge leap forward in power efficiency, delivering up to 3x the performance per watt of Fermi. The following new features in Kepler GK110 enable increased GPU utilization, simplify parallel program design, and aid in the deployment of GPUs across the spectrum of compute environ- ments ranging from personal workstations to supercomputers: ʎʎ Dynamic Parallelism – adds the capability for the GPU to ge- nerate new work for itself, synchronize on results, and con- trol the scheduling of that work via dedicated, accelerated hardware paths, all without involving the CPU. By providing the flexibility to adapt to the amount and form of parallelism NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture
    • 99 through the course of a program’s execution, programmers can expose more varied kinds of parallel work and make the most efficient use the GPU as a computation evolves. This ca- pability allows less - structured, more complex tasks to run ea- sily and effectively, enabling larger portions of an application to run entirely on the GPU. In addition, programs are easier to create, and the CPU is freed for other tasks. ʎʎ Hyper-Q – Hyper-Q enables multiple CPU cores to launch work on a single GPU simultaneously, thereby dramatically increa- sing GPU utilization and significantly reducing CPU idle times. Hyper-Q increases the total number of connections (work queues) between the host and the GK110 GPU by allowing 32 simultaneous, hardware-managed connections (compared to the single connection available with Fermi). Hyper-Q is a flexi- ble solution that allows separate connections from multiple CUDA streams, from multiple Message Passing Interface (MPI) processes, or even from multiple threads within a process. Applications that previously encountered false serialization across tasks, thereby limiting achieved GPU utilization, can see up to dramatic performance increase without changing any existing code. ʎʎ Grid Management Unit – Enabling Dynamic Parallelism re- quires an advanced, flexible grid management and dispatch control system. The new GK110 Grid Management Unit (GMU) manages and prioritizes grids to be executed on the GPU. The GMU can pause the dispatch of new grids and queue pending andsuspendedgridsuntiltheyarereadytoexecute,providing the flexibility to enable powerful runtimes, such as Dynamic Parallelism. The GMU ensures both CPU- and GPU-generated workloads are properly managed and dispatched. ʎʎ NVIDIA GPUDirect – NVIDIA GPUDirect is a capability that enables GPUs within a single computer, or GPUs in different servers located across a network, to directly exchange data without needing to go to CPU/system memory. The RDMA feature in GPUDirect allows third party devices such as SSDs, NICs, and IB adapters to directly access memory on multiple GPUs within the same system, significantly decreasing the latency of MPI send and receive messages to/from GPU me- mory. It also reduces demands on system memory bandwidth and frees the GPU DMA engines for use by other CUDA tasks. Kepler GK110 also supports other GPUDirect features inclu- ding Peer-to-Peer and GPUDirect for Video. An Overview of the GK110 Kepler Architecture Kepler GK110 was built first and foremost for Tesla, and its goal was to be the highest performing parallel computing micropro- cessor in the world. GK110 not only greatly exceeds the raw com- pute horsepower delivered by Fermi, but it does so efficiently, consuming significantly less power and generating much less heat output.
    • 100 Introducing NVIDIA Parallel Nsight NVIDIA Parallel Nsight is the first development environment de- signed specifically to support massively parallel CUDA C, OpenCL, and DirectCompute applications. It bridges the productivity gap between CPU and GPU code by bringing parallel-aware hardware source code debugging and performance analysis directly into Mi- crosoft Visual Studio, the most widely used integrated application development environment under Microsoft Windows. Parallel Nsight allows Visual Studio developers to write and de- bug GPU source code using exactly the same tools and interfaces that are used when writing and debugging CPU code, including source and data breakpoints, and memory inspection. Further- more, Parallel Nsight extends Visual Studio functionality by of- fering tools to manage massive parallelism, such as the ability to focus and debug on a single thread out of the thousands of threads running parallel, and the ability to simply and efficiently visualize the results computed by all parallel threads. Parallel Nsight is the perfect environment to develop co-pro- cessing applications that take advantage of both the CPU and GPU. It captures performance events and information across both processors, and presents the information to the devel- oper on a single correlated timeline. This allows developers to see how their application behaves and performs on the entire system, rather than through a narrow view that is focused on a particular subsystem or processor. NVIDIAGPUComputing Introducing NVIDIA Parallel Nsight
    • 101 Parallel Nsight Debugger for GPU Computing ʎʎ Debug your CUDA C/C++ and DirectCompute source code di- rectly on the GPU hardware ʎʎ As the industry’s only GPU hardware debugging solution, it drastically increases debugging speed and accuracy ʎʎ Use the familiar Visual Studio Locals, Watches, Memory and Breakpoints windows Parallel Nsight Analysis Tool for GPU Computing ʎʎ Isolate performance bottlenecks by viewing system-wide CPU+GPU events ʎʎ Support for all major GPU Computing APIs, including CUDA C/ C++, OpenCL, and Microsoft DirectCompute Debugger for GPU Computing Analysis Tool for GPU Computing
    • transtec has strived for developing well-engineered GPU Com- puting solutions from the very beginning of the Tesla era. From High-Performance GPU Workstations to rack-mounted Tesla server solutions, transtec has a broad range of specially designed systems available. As an NVIDIA Tesla Preferred Provider (TPP), transtec is able to provide customers with the latest NVIDIA GPU technology as well as fully-engineered hybrid systems and Tesla Preconfigured Clusters. Thus, customers can be assured that transtec’s large experience in HPC cluster solutions is seamlessly brought into the GPU computing world. Performance Engineer- ing made by transtec. 102 NVIDIAGPUComputing Introducing NVIDIA Parallel Nsight
    • 103 Parallel Nsight Debugger for Graphics Development ʎʎ Debug HLSL shaders directly on the GPU hardware. Drastical- ly increasing debugging speed and accuracy over emulated (SW) debugging ʎʎ Use the familiar Visual Studio Locals, Watches, Memory and Breakpoints windows with HLSL shaders, including Direct- Compute code ʎʎ The Debugger supports all HLSL shader types: Vertex, Pixel, Geometry, and Tessellation Parallel Nsight Graphics Inspector for Graphics Development ʎʎ Graphics Inspector captures Direct3D rendered frames for real-time examination ʎʎ The Frame Profiler automatically identifies bottlenecks and performance information on a per-draw call basis ʎʎ Pixel History shows you all operations that affected a given pixel Debugger for Graphics Development Graphics Inspector for Graphics Development
    • 104 A full Kepler GK110 implementation includes 15 SMX units and six 64-bit memory controllers. Different products will use differ- ent configurations of GK110. For example, some products may deploy 13 or 14 SMXs. Key features of the architecture that will be discussed below in more depth include: ʎʎ The new SMX processor architecture ʎʎ An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O im- plementation. ʎʎ Hardware support throughout the design to enable new pro- gramming model capabilities Kepler GK110 supports the new CUDA Compute Capability 3.5. (For a brief overview of CUDA see Appendix A - Quick Refresher on CUDA). The following table compares parameters of different Compute Capabilities for Fermi and Kepler GPU architectures: Performance per Watt A principal design goal for the Kepler architecture was impro- ving power efficiency. When designing Kepler, NVIDIA engineers NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture Kepler GK110 Full chip block diagram
    • 105 applied everything learned from Fermi to better optimize the Kepler architecture for highly efficient operation. TSMC’s 28nm manufacturing process plays an important role in lowering pow- er consumption, but many GPU architecture modifications were required to further reduce power consumption while maintai- ning great performance. Fermi GF100 Fermi GF104 Kepler GK104 Kepler GK110 Compute Capability 2.0 2.1 3.0 3.5 Threads / Warp 32 32 32 32 Max Warps / Multiprocessor 48 48 64 64 Max Threads / Multiproces- sor 1536 1536 2048 2048 Max Thread Blocks / Multi- processor 8 8 16 16 32-bit Registers / Multipro- cessor 32768 32768 65536 65536 Max Registers / Thread 63 63 63 255 Max Threads / Thread Block 1024 1024 1024 1024 Shared Memory Size Con- figurations (bytes) 16K 48K 16K 48K 16K 32K 48K 16K 32K 48K Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1 Hyper-Q no no no yes Dynamic Parallelism no no no yes Every hardware unit in Kepler was designed and scrubbed to provide outstanding performance per watt. The best example of great perf/watt is seen in the design of Kepler GK110’s new Streaming Multiprocessor (SMX), which is similar in many res- pects to the SMX unit recently introduced in Kepler GK104, but includes substantially more double precision units for compute algorithms. Streaming Multiprocessor (SMX) Architecture Kepler GK110’s new SMX introduces several architectural inno- vations that make it not only the most powerful multiprocessor we’ve built, but also the most programmable and power-effi- cient. SMX Processing Core Architecture Each of the Kepler GK110 SMX units feature 192 single-precision CUDA cores, and each core has fully pipelined floating-point and integer arithmetic logic units. Kepler retains the full IEEE 754-2008 compliant single- and double-precision arithmetic introduced in Fermi, including the fused multiply-add (FMA) operation. One of the design goals for the Kepler GK110 SMX was to signifi- cantlyincreasetheGPU’sdelivereddoubleprecisionperformance, since double precision arithmetic is at the heart of many HPC ap- plications. Kepler GK110’s SMX also retains the special function units (SFUs) for fast approximate transcendental operations as in previous-generation GPUs, providing 8x the number of SFUs of the Fermi GF110 SM. Similar to GK104 SMX units, the cores within the new GK110 SMX units use the primary GPU clock rather than the 2x shader clock. Recall the 2x shader clock was introduced in the G80 Tes- la-architecture GPU and used in all subsequent Tesla- and Fermi- architecture GPUs. Running execution units at a higher clock rate SMX: 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).
    • 106 allows a chip to achieve a given target throughput with fewer copies of the execution units, which is essentially an area optimi- zation, but the clocking logic for the faster cores is more power- hungry. For Kepler, our priority was performance per watt. While we made many optimizations that benefitted both area and pow- er, we chose to optimize for power even at the expense of some addedareacost,withalargernumberofprocessingcoresrunning at the lower, less power-hungry GPU clock. Quad Warp Scheduler The SMX schedules threads in groups of 32 parallel threads called warps. Each SMX features four warp schedulers and eight instruc- tion dispatch units, allowing four warps to be issued and execut- ed concurrently. Kepler’s quad warp scheduler selects four warps, and two independent instructions per warp can be dispatched each cycle. Unlike Fermi, which did not permit double precision instructions to be paired with other instructions, Kepler GK110 allows double precision instructions to be paired with other ins- tructions. NVIDIA, GeForce, Tesla, CUDA, PhysX, GigaThread, NVIDIA Parallel Data Cache and certain other trademarks and logos appearing in this brochure, are trademarks or registered trademarks of NVIDIA Corporation. NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture Each Kepler SMX contains 4 Warp Schedulers, each with dual Instruction Dis- patch Units. A single Warp Scheduler Unit is shown above.
    • 107 We also looked for opportunities to optimize the power in the SMX warp scheduler logic. For example, both Kepler and Fermi schedulers contain similar hardware units to handle the schedu- ling function, including: ʎʎ Register scoreboarding for long latency operations (texture and load) ʎʎ Inter-warp scheduling decisions (e.g., pick the best warp to go next among eligible candidates) ʎʎ Thread block level scheduling (e.g., the GigaThread engine) However, Fermi’s scheduler also contains a complex hardware stage to prevent data hazards in the math datapath itself. A mul- ti-port register scoreboard keeps track of any registers that are not yet ready with valid data, and a dependency checker block analyzes register usage across a multitude of fully decoded warp instructions against the scoreboard, to determine which are eli- gible to issue. For Kepler, we recognized that this information is deterministic (the math pipeline latencies are not variable), and therefore it is possible for the compiler to determine up front when instructions will be ready to issue, and provide this information in the instruc- tion itself. This allowed us to replace several complex and power- expensive blocks with a simple hardware block that extracts the pre-determined latency information and uses it to mask out warps from eligibility at the inter-warp scheduler stage. New ISA Encoding: 255 Registers per Thread The number of registers that can be accessed by a thread has been quadrupled in GK1 10, allowing each thread access to up to 255 registers. Codes that exhibit high register pressure or spill- ing behavior in Fermi may see substantial speedups as a result of the increased available per-thread register count . A compel- ling example can be seen in the QUDA library for performing lat- tice QCD (quantum chromodynamics) calculations using CUDA. QUDA fp64-based algorithms see performance increases up to 5.3x due to the ability to use many more registers per thread and experiencing fewer spills to local memory. Shuffle Instruction To further improve performance, Kepler implements a new Shuffle instruction, which allows threads within a warp to share data. Previously, sharing data between threads within a warp required separate store and load operations to pas the data through shared memory. With the Shuffle instruction, threads within a warp can read values from other threads in the warp in just about any imaginable permutation. Shuffle supports arbi- trary indexed references –– i.e. any thread reads from any other thread. U seful shuffle subsets including next-thread (offset u p or down by a fixed amount) and XORR “butterfly” style permuta- tions among the threads in a warp, are also available as CUDA intrinsics. This example shows some of the variations possible using the new Shuffle instruction in Kepler.
    • 108 Shuffle offers a performance advantage over shared memory, in that a store-and-load operation is carried out in a single step. Shuffle also can reduce the amount of shared memory needed per thread block, since data exchanged at the warp level never needs to o be placed in shared memory. In the case of FFT, which requires da ta sharing within a warp, a 6% performance gain can be seen just by using Shuffle. Atomic Operations Atomic memory operations are important in parallel program- ming, allowing concurrent threads to correctly perform read- modify-write operations on shared data structures. Atomic oper- ations such as add, min, max, and compare-and-swap are atomic in the sense that the read, modify, and write operations are per- formed without interruption by other threads. Atomic memory operations are widely used for parallel sorting, reduction opera- tions, and building data structures in parallel without locks that Serialize thread execution. Throughput of global memory atomic operations on Kepler GK110 is substantially improved compared to the Fermi genera- tion. Atomic operation throughput to a common global memory address is improved by 9x to one operation per clock. Atomic operation throughput to independent global addresses is also significantly accelerated, and logic to handle address conflicts has been made more efficient. Atomic operations can often be processed at rates similar to global load operations. This speed increase makes atomics fast enough to use frequently within kernel inner loops, eliminating the separate reduction passes that were previously required by some algorithms to consoli- date results. Kepler GK110 also expands the native support for 64-bit atomic operations in global memory. In addition to atomi- cAdd, atomicCAS, and atomicExch (which were also supported by Fermi and Kepler GK104), GK110 supports the following: ʎʎ atomicMin ʎʎ atomicMax NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture
    • 109 ʎʎ atomicAnd ʎʎ atomicOr ʎʎ atomicXor Other atomic operations which are not supported natively (for example 64-bit floating point atomics) may be emulated using the compare-and-swap (CAS) instruction. Texture Improvements The GPU’s dedicated hardware Texture units are a valuable re- source for compute programs with a need to sample or filter image data. The texture throughput in Kepler is significantly in- creased compared to Fermi – each SMX unit contains 16 texture filtering units, a 4x increase vs the Fermi GF110 SM. In addition, Kepler changes the way texture state is managed. In the Fermi generation, for the GPU to reference a texture, it had to be assigned a “slot” in a fixed-size binding table prior to grid launch. The number of slots in that table ultimately limits how many unique textures a program can read from at runtime. Ulti- mately, a program was limited to accessing only 128 simultane- ous textures in Fermi. With bindless textures in Kepler, the additional step of using slots isn’t necessary: texture state is now saved as an object in memory and the hardware fetches these state objects on de- mand, making binding tables obsolete. This effectively elimi- nates any limits on the number of unique textures that can be referenced by a compute program. Instead, programs can map textures at any time and pass texture handles around as they would any other pointer. Kepler Memory Subsystem – L1, L2, ECC Kepler’s memory hierarchy is organized similarly to Fermi. The Kepler architecture supports a unified memory request path for loads and stores, with an L1 cache per SMX multiprocessor. Ke- pler GK110 also enables compiler-directed use of an additional new cache for read-only data, as described below. 64 KB Configurable Shared Memory and L1 Cache In the Kepler GK110 architecture, as in the previous generation Fermi architecture, each SMX has 64 KB of on-chip memory that can be configured as 48 KB of Shared memory with 16 KB of L1 cache, or as 16 KB of shared memory with 48 KB of L1 cache. Ke- pler now allows for additional flexibility in configuring the al- location of shared memory and L1 cache by permitting a 32KB / 32KB split between shared memory and L1 cache. To support the increased throughput of each SMX unit, the shared memory bandwidth for 64b and larger load operations is also doubled compared to the Fermi SM, to 256B per core clock. 48KB Read-Only Data Cache In addition to the L1 cache, Kepler introduces a 48KB cache for data that is known to be read-only for the duration of the func- tion. In the Fermi generation, this cache was accessible only by the Texture unit. Expert programmers often found it advanta- geous to load data through this path explicitly by mapping their
    • 110 data as textures, but this approach had many limitations. In Kepler, in addition to significantly increasing the capacity of this cache along with the texture horsepower increase, we de- cided to make the cache directly accessible to the SM for general load operations. Use of the read-only path is beneficial because it takes both load and working set footprint off of the Shared/L1 cache path. In addition, the Read-Only Data Cache’s higher tag bandwidth supports full speed unaligned memory access pat- terns among other scenarios. Use of the read-only path can be managed automatically by the compiler or explicitly by the programmer. Access to any variable or data structure that is known to be constant through program- mer use of the C99-standard “const __restrict” keyword may be tagged by the compiler to be loaded through the Read- Only Data Cache. The programmer can also explicitly use this path with the __ldg() intrinsic. Improved L2 Cache The Kepler GK110 GPU features 1536KB of dedicated L2 cache memory, double the amount of L2 available in the Fermi archi- tecture. The L2 cache is the primary point of data unification between the SMX units, servicing all load, store, and texture re- quests and providing efficient, high speed data sharing across the GPU. The L2 cache on Kepler offers up to 2x of the bandwidth per clock available in Fermi. Algorithms for which data addresses are not known beforehand, such as physics solvers, ray tracing, and sparse matrix multiplication especially benefit from the cache hierarchy. Filter and convolution kernels that require mul- tiple SMs to read the same data also benefit. Memory Protection Support Like Fermi, Kepler’s register files, shared memories, L1 cache, L2 cache and DRAM memory are protected by a Single-Error Correct Double-Error Detect (SECDED) ECC code. In addition, the Read-Only DataCachesupportssingle-errorcorrectionthroughaparitycheck; in the event of a parity error, the cache unit automatically invali- NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture
    • 111 dates the failed line, forcing a read of the correct data from L2. ECC checkbit fetches from DRAM necessarily consume some amount of DRAM bandwidth, which results in a performance difference between ECC-enabled and ECC-disabled operation, especially on memory bandwidth-sensitive applications. Ke- pler GK110 implements several optimizations to ECC checkbit fetch handling based on Fermi experience. As a result, the ECC on-vs-off performance delta has been reduced by an average of 66%, as measured across our internal compute application test suite. Dynamic Parallelism In a hybrid CPU-GPU system, enabling a larger amount of paral- lel code in an application to run efficiently and entirely within the GPU improves scalability and performance as GPUs increase in perf/watt. To accelerate these additional parallel portions of the application, GPUs must support more varied types of parallel workloads. DynamicParallelismisanewfeatureintroducedwithKeplerGK110 that allows the GPU to generate new work for itself, synchronize on results, and control the scheduling of that work via dedicated, ac- celerated hardware paths, all without involving the CPU. Fermi was very good at processing large parallel data structures when the scale and parameters of the problem were known at kernel launch time. All work was launched from the host CPU, would run to completion, and return a result back to the CPU. The result would then be used as part of the final solution, or would be analyzed by the CPU which would then send additional requests back to the GPU for additional processing. In Kepler GK110 any kernel can launch another kernel, and can create the necessary streams, events and manage the depen- dencies needed to process additional work without the need for host CPU interaction. This architectural innovation makes it easier for developers to create and optimize recursive and data- dependent execution patterns, and allows more of a program to be run directly on GPU. The system CPU can then be freed up for additional tasks, or the system could be configured with a less powerful CPU to carry out the same workload. Dynamic Parallelism allows more varieties of parallel algorithms to be implemented on the GPU, including nested loops with dif- fering amounts of parallelism, parallel teams of serial control- task threads, or simple serial control code offloaded to the GPU in order to promote data-locality with the parallel portion of the application. Because a kernel has the ability to launch additional workloads based on intermediate, on-GPU results, programmers can now intelligently load-balance work to focus the bulk of their re- sources on the areas of the problem that either require the most processing power or are most relevant to the solution. One example would be dynamically setting up a grid for a nu- merical simulation – typically grid cells are focused in regions of greatest change, requiring an expensive pre-processing pass through the data. Alternatively, a uniformly coarse grid could be used to prevent wasted GPU resources, or a uniformlyfine grid could be used to ensure all the features are captured, but these options risk missing simulation features or “over-spending” compute resources on regions of less interest. With Dynamic Parallelism, the grid resolution can be determined dynamically at Dynamic Parallelism allows more parallel code in an application to be launched directly by the GPU onto itself (right side of image) rather than requiring CPU intervention (left side of image).
    • 112 NVIDIAGPUComputing A Quick Refresher on CUDA
    • 113 A Quick Refresher on CUDA CUDA is a combination hardware/software platform that en- ables NVIDIA GPUs to execute programs written with C, C++, Fortran, and other languages. A CUDA program invokes paral- lel functions called kernels that execute accross many parallel threads. The programmer or compiler organizes these threads into thread blocks and grids of thre ad blocks, as shown in the Figure on the right side. Each thread within a thread block ex- ecutes an instance oof the kernel. Each thread also has thread and block IDs within its thread block and grid, a program coun- ter, registers, per-thread private memory, inputs, and output re- suults. A thread block is a set of concurren tly executing threads that can cooperate among themselves through barrier sy nchronizationn and shared memory. A t hread block has a block ID within its grid. A grid is an array of thread blocks that execute the same kernel, read inputs from global memory, write results to global memory, and synchronize between de- pendent kernel calls. In the CUDA parallel programming model, each thread has a per-thread private memory space used for register spills, function calls, and C automaticc array variables. Each thread block has a per-block shared memory space used for inter-thread communication, datasharing, and result sharing in parallel allgo- rithms. Gr rids of thread blocks share results in Global Memory space after kernel-wide global synchronization. CUDA Hardware Execution CUDA’s hierarchy of threads maps to a hierarchy of processors on the GPU; a GPU executes one or more kernel grids; a stream- ing multiprocessor (SM on Fermi / SMX on Kepler) executes one or more thread blocks; and CUDA cores and other execution units in the SMX execute thread instructions. The SMX executes threads in groups of 32 threads called warps. While programmers can generally ignore warp execution for functional correctness and focus on programming individual scalar threads, they can greatly improve performance by having threads in a warp execute the same code path and access memory with nearby addresses. CUDA Hierarchy of threads, blocks, and grids, with corresponding per-thread private, per-block shared,-and per-application global memory spaces.
    • 114 runtime in a datadependent manner. Starting with a coarse grid, the simulation can “zoom in” on ar- eas of interest while avoiding unnecessary calculation in areas with little change. Though this could be accomplished using a se- quence of CPU-launched kernels, it would be far simpler to allow the GPU to refine the grid itself by analyzing the data and launch- ing additional work as part of a single simulation kernel, eliminat- ing interruption of the CPU and data transfers between the CPU and GPU. The above example illustrates the benefits of using a dynami- cally sized grid in a numerical simulation. To meet peak preci- sion requirements, a fixed resolution simulation must run at an excessively fine resolution across the entire simulation domain, whereas a multi-resolution grid applies the correct simulation resolution to each area based on local variation. Hyper-Q One of the challenges in the past has been keeping the GPU sup- plied with an optimally scheduled load of work from multiple streams. Thee Fermi architecture supported 16-way concur- rency of kernel launches from separate streams, but ultimately the streams were all multiplexed into the same hardware work queue . This allowed for false intra-stream dependencies, requir- NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture Image attribution Charles Reid
    • 115 ing dependent kernels within one stream to complete before ad- ditional kernels in a separate stream could be executed. While this could be alleviated to some extent through the use of a breadth- first launch order , as program complexity increases, this can be- come more and more difficult to manage efficiently. Kepler GKK110 improves on this functionality with the new Hy- per-Q feature. Hyper-Q increases the total number of connection s (work queues) between the host and the CUDA Work Distribu- tor (CWD) logic in the GPU by allowing 322 simultaneous, hard- ware-managed connections (compared to the single connection available with Fermi). Hyper-Q is a flexible solution that allows connections from multiple CUDA streams, from multiple Mes- sage Passing Interface (MP PI) processes, or even from multiple threads within a process. Applications that previously encoun- tered false serialization across tasks, there by limiting GPU utili- zation, can see up to a 32x performance increase without chang- ing any existing co de. Each CUDA stream is managed within its own hardware work queue, inter-stream dependencies are optimized, and opera- tions in one stream will no longer block other streams, enabling streams to execute concurrently without neding to specifically tailor the launch order to eliminate possible false dependencies. Hyper-Q offers significant benefits for use in MPI-based paral- lel computer systems. Legacy MPI-based algorithms were often created to run on multi-core CPU systems, with the amount of work assigned to each MPI process scaled accordingly. This can lead to a single MPI process having insufficient work to fully oc- cupy the GPU. While it has always been possible for multiple MPI processes to share a GPU, these processes could become bottle- necked by false dependencies. Hyper-Q removes those false de- pendencies, dramatically increasing the efficiency of GPU shar- ing across MPI processes. Grid Management Unit - Efficiently Keeping the GPU Utilized New features in Kepler GK110, such as the ability for CUDA ker- nels to launch work directly on the GPU with Dynamic Parallel- ism, required that the CPU-to-GPU workflow in Kepler offer in- creased functionality over the Fermi design. On Fermi, a grid of thread blocks would be launched by the CPU and would always run to completion, creating a simple unidirectional flow of work Hyper-Q permits more simultaneous connections between CPU and GPU Hyper-Q working with CUDA Streams: In the Fermi model shown on the left, only (C,P) & (R,X) can run concurrently due to intra-stream dependencies caused by the single hardware work queue. The Kepler Hyper-Q model allows all streams to run concurrently using separate work queues.
    • 116 from the host to the SMs via the CUDA Work Distributor (CWD) unit. Kepler GK110 was designed to improve the CPU-to-GPU workflow by allowing the GPU to efficiently manage both CPU- and CUDA-created workloads. We discussed the ability of the Kepler GK110 GPU to allow ker- nels to launch work directly on the GPU, and it’s important to understand the changes made in the Kepler GK110 architec- ture to facilitate these new functions. In Kepler, a grid can be launched from the CPU just as was the case with Fermi, how- ever new grids can also be created programmatically by CUDA within the Kepler SMX unit. To manage both CUDA-created and host-originated grids, a new Grid Management Unit (GMU) was introduced in Kepler GK110. This control unit manages and pri- oritizes grids that are passed into the CWD to be sent to the SMX units for execution. The CWD in Kepler holds grids that are ready to dispatch, and it is able to dispatch 32 active grids, which is double the capacity of NVIDIAGPUComputing Kepler GK110 – The Next Generation GPU Computing Architecture The redesigned Kepler HOST to GPU workflow shows the new Grid Manage- ment Unit, which allows it to manage the actively dispatching grids, pause dispatch, and hold pending and suspended grids.
    • 117 the Fermi CWD. The Kepler CWD communicates with the GMU via a bidirectional link that allows the GMU to pause the dispatch of new grids and to hold pending and suspended grids until needed. The GMU also has a direct connection to the Kepler SMX units to permit grids that launch additional work on the GPU via Dynamic Parallelism to send the new work back to GMU to be prioritized and dispatched. If the kernel that dispatched the ad- ditional workload pauses, the GMU will hold it inactive until the dependent work has completed. NVIDIA GPUDirect When working with a large amount of data, increasing the data throughput and reducing latency is vital to increasing compute performance. Kepler GK110 supports the RDMA feature in NVIDIA GPUDirect, which is designed to improve performance by allow- ing direct access to GPU memory by third-party devices such as IB adapters, NICs, and SSDs. When using CUDA 5.0, GPUDirect pro- vides the following important features: ʎʎ Direct memory access (DMA) between NIC and GPU without the need for CPU-side data buffering. ʎʎ Significantly improved MPISend/MPIRecv efficiency between GPU and other nodes in a network. ʎʎ Eliminates CPU bandwidth and latency bottlenecks ʎʎ Works with variety of 3rd-party network, capture, and storage devices Applications like reverse time migration (used in seismic imag- ing for oil & gas exploration) distribute the large imaging data across several GPUs. Hundreds of GPUs must collaborate to crunch the data, often communicating intermediate results. GPUDirect enables much higher aggregate bandwidth for this GPUto GPU communication scenario within a server and across servers with the P2P and RDMA features. Kepler GK110 also supports other GPUDirect features such as Peer- to-Peer and GPUDirect for Video. Conclusion With the launch of Fermi in 2010, NVIDIA ushered in a new era in the high performance computing (HPC) industry based on a hybrid computing model where CPUs and GPUs work together to solve computationally-intensive workloads. Now, with the new Kepler GK110 GPU, NVIDIA again raises the bar for the HPC industry. Kepler GK110 was designed from the ground up to maximize computational performance and throughput computing with outstanding power efficiency. The architecture has many new innovations such as SMX, Dynamic Parallelism, and Hyper-Q that make hybrid computing dramatically faster, easier to program, and applicable to a broader set of applications. Kepler GK110 GPUs will be used in numerous systems ranging from worksta- tions to supercomputers to address the most daunting chal- lenges in HPC. GPUDirect RDMA allows direct access to GPU memory from 3rd-party devices such as network adapters, which translates into direct transfers between GPUs across nodes as well.
    • 118 Executive Overview The High Performance Computing market’s continuing need for improved time-to-solution and the ability to explore expanding models seems unquenchable, requiring ever-faster HPC clus- ters. This has led many HPC users to implement graphic pro- cessing units (GPUs) into their clusters. While GPUs have tradi- tionally been used solely for visualization or animation, today they serve as fully programmable, massively parallel proces- sors, allowing computing tasks to be divided and concurrently processed on the GPU’s many processing cores. When multiple GPUs are integrated into an HPC cluster, the performance po- tential of the HPC cluster is greatly enhanced. This processing environment enables scientists and researchers to tackle some of the world’s most challenging computational problems. HPC applications modified to take advantage of the GPU pro- cessing capabilities can benefit from significant performance gains over clusters implemented with traditional processors. To obtain these results, HPC clusters with multiple GPUs require a high-performance interconnect to handle the GPU-to-GPU com- munications and optimize the overall performance potential of the GPUs. Because the GPUs place significant demands on the interconnect, it takes a high-performance interconnect, such as InfiniBand, to provide the low latency, high message rate, and bandwidth that are needed to enable all resources in the cluster to run at peak performance. Intel worked in concert with NVIDIA to optimize Intel TrueScale InfiniBand with NVIDIA GPU technologies. This solution sup- ports the full performance potential of NVIDIA GPUs through an interface that is easy to deploy and maintain. Key Points ʎʎ Up to 44 percent GPU performance improvement versus implementation without GPUDirect – a GPU computing NVIDIAGPUComputing Intel TrueScale InfiniBand and GPUs Intel is a global leader and technology innovator in high perfor- mance networking, including adapters, switches and ASICs.
    • 119 product from NVIDIA that enables faster communication be- tween the GPU and InfiniBand ʎʎ Intel TrueScale InfiniBand offers as much as 10 percent better GPU performance than other InfiniBand interconnects ʎʎ Ease of installation and maintenance – Intel’s implementa- tion offers a streamlined deployment approach that is sig- nificantly easier than alternatives Ease of Deployment One of the key challenges with deploying clusters consisting of multi-GPU nodes is to maximize application performance. With- out GPUDirect, GPU-to-GPU communications would require the host CPU to make multiple memory copies to avoid a memory pinning conflict between the GPU and InfiniBand. Each addi- tional CPU memory copy significantly reduces the performance potential of the GPUs. Intel’s implementation of GPUDirect takes a streamlined ap- proach to optimizing NVIDIA GPU performance with Intel TrueS- cale InfiniBand. With Intel’s solution, a user only needs to update the NVIDIA driver with code provided and tested by Intel. Other InfiniBand implementations require the user to implement a Linux kernel patch as well as a special InfiniBand driver. The Intel approach provides a much easier way to deploy, support, and maintain GPUs in a cluster without having to sacrifice per- formance. In addition, it is completely compatible with other GPUDirect implementations; the CUDA libraries and application code require no changes. Optimized Performance Intel used AMBER molecular dynamics simulation software to test clustered GPU performance with and without GPUDirect. Figure 15 shows that there is a significant performance gain of up to 44 percent that results from streamlining the host memory access to support GPU-to-GPU communications. Clustered GPU Performance HPC applications that have been designed to take advantage of parallel GPU performance require a high-performance inter- connect, such as InfiniBand, to maximize that performance. In addition, the implementation or architecture of the InfiniBand interconnect can impact performance. The two industry-leading InfiniBand implementations have very different architectures and only one was specifically designed for the HPC market – Intel’s TrueScale InfiniBand. TrueScale InfiniBand provides un- matched performance benefits, especially as the GPU cluster is scaled. It offers high performance in all of the key areas that influ- ence the performance of HPC applications, including GPU-based applications. These factors include the following: ʎʎ Scalable non-coalesced message rate performance greater than 25M messages per second ʎʎ Extremely low latency for MPI collectives, even on clusters consisting of thousands of nodes ʎʎ Consistently low latency of one to two μS, even at scale These factors and the design of Intel TrueScale InfiniBand en- able it to optimize the performance of NVIDIA GPUs. The follow- Performance with and without GPUDirect Amber Performance Cellulose Test 8 GPU Without GPUDirect TrueScale with GPUDirect 4,5 4 3,5 3 2,5 2 1,5 1 0,5 0 44% Better NS/DAY
    • 120 ing tests were performed on NVIDIA Tesla 2050s interconnected with Intel TrueScale QDR InfiniBand at Intel’s NETtrack Develop- er Center. The Tesla 2050 results for the industry’s other leading InfiniBand are from the published results on the AMBER bench- mark site. Figure 16 shows performance results from the AMBER Myoglo- bin benchmark (2,492 atoms) when scaling from two to eight Tesla 2050 GPUs. The results indicate that the Intel TrueScale InfiniBand offers up to 10 percent more performance than the industry’s other leading InfiniBand when both used their ver- sions of GPUDirect. As the figure shows, the performance dif- ference increases as the application is scaled to more GPUs. The next test (Figure 17) shows the impact of the InfiniBand interconnect on the performance of AMBER across models of various sizes. The following Explicit Solvent models were tested: ʎʎ DHFR: 23,558 atoms ʎʎ FactorIX: 90,906 atoms ʎʎ Cellulose: 408,609 atoms It is important to point out that the performance of the models is dependent on the model size, the size of the GPU cluster, and the performance of the InfiniBand interconnect. The smaller the model, the more it is dependent on the interconnect due to the fact that the model’s components (atoms in the case of AMBER) are divided across the available GPUs in the cluster to be processed for each step of the simulation. For example, the DHFR test with its 23,557 atoms means that each Tesla 2050 in an eight-GPU cluster is processing only 2,945 atoms for each step of the simulation. The processing time is relatively small when compared to the communication time. In contrast, the Cellulose model with its 408K atoms requires each GPU to process 17 times more data per step than the DHFR test, so significantly more time is spent in GPU processing than in communications. NVIDIAGPUComputing Intel TrueScale InfiniBand and GPUs
    • 121 The preceding tests demonstrate that the TrueScale InfiniBand performs better under load. The DHFR model is the most sensi- tive to the interconnect performance, and it indicates that Tru- eScale offers six percent more performance than the alternative InfiniBand product. Combining the results from Figure 15 and Figure 16 illustrate that TrueScale InfiniBand provides better results with smaller models on small clusters and better model scalability for larger models on larger GPU clusters. Performance/Watt Advantage Today the focus is not just on performance, but how efficiently that performance can be delivered. This is an area in which Intel TrueScale InfiniBand excels. The National Center for SuperCompute Applications (NCSA) has a cluster based on NVIDIA GPUs interconnected with TrueScale InfiniBand. This cluster is number three on the November 2010 Green500 list with performance of 933 MFlops/Watt . This on its own is a significant accomplishment, but it is even more impressive when considering its original position on the SuperComputing Top500 list. In fact, the cluster is ranked at #404 on the Top500 list, but the combination of NVIDIA’s GPU perfor- mance, Intel’s TrueScale performance, and low power consump- tion enabled the cluster to move up 401 spots from the Top500 list to reach number three on the Green500 list. This is the most dramatic shift of any cluster in the top 50 of the Green500. In part, the following are the reasons for such dramatic perfor- mance/watt results: ʎʎ Performance of the NVIDIA Telsa 2050 GPU ʎʎ Linpack performance efficiency of this cluster is 49 percent, which is almost 20 percent better than most other NVIDIA GPUbased clusters on the Top 500 list ʎʎ The Intel TrueScale InfiniBand Adapter required 25 - 50 per- cent less power than the alternative InfiniBand product Conclusion The performance of the InfiniBand interconnect has a significant impact on the performance of GPU-based clusters. Intel’s Tru- eScale InfiniBand is designed and architected for the HPC mar- ketplace, and it offers an unmatched performance profile with a GPU-based cluster. Finally, Intel’s solution provides an imple- mentation that is easier to deploy and maintain, and allows for optimal performance in comparison to the industry’s other lead- ing InfiniBand. ExplicitSolventBenchmarkResultsfortheTwoLeadingInfiniBandsGPU Scalable Performance with the Industry’s Leading InfiniBands AMBER Myoglobin Test Other InfiniBand TrueScale 140 9,6% 3,2% Even 120 100 80 60 40 20 0 Better 2 4 8 NS/DAY Explicit Solvent Tests 8 x GPU Other InfiniBand TrueScale 50 6% 5% 1% 45 40 35 30 25 20 15 10 5 0 Better Cellulose FactorIX DHFR NS/DAY
    • IntelXeonPhi Coprocessor– TheArchitecture
    • 123 Intel Many Integrated Core (IntelMIC)architecturecom- bines many Intel CPU cores onto a single chip. Intel MIC architecture is targeted for highly parallel, High Per- formance Computing (HPC) workloads in a variety of fieldssuchascomputational physics, chemistry, biology, and financial services. To- day such workloads are run as task parallel programs on large compute clusters.
    • 124 IntelXeonPhiCoprocessor The Architecture The Intel MIC architecture is aimed at achieving high through- put performance in cluster environments where there are rigid floor planning and power constraints. A key attribute of the mi- croarchitecture is that it is built to provide a general-purpose programming environment similar to the Intel Xeon processor programming environment. The Intel Xeon Phi coprocessors based on the Intel MIC architecture run a full service Linux op- erating system, support x86 memory order model and IEEE 754 floating-point arithmetic, and are capable of running applica- tions written in industry-standard programming languages such as Fortran, C, and C++. The coprocessor is supported by a rich development environment that includes compilers, numerous libraries such as threading libraries and high performance math libraries, performance characterizing and tuning tools, and de- buggers. The Intel Xeon Phi coprocessor is connected to an Intel Xeon processor, also known as the “host”, through a PCI Express (PCIe) bus. Since the Intel Xeon Phi coprocessor runs a Linux operating system, a virtualized TCP/IP stack could be implemented over the PCIe bus, allowing the user to access the coprocessor as a network node. Thus, any user can connect to the coprocessor The first generation Intel Xeon Phi product codenamed “Knights Corner”
    • 125 through a secure shell and directly run individual jobs or submit batch jobs to it. The coprocessor also supports heterogeneous applications wherein a part of the application executes on the host while a part executes on the coprocessor. Multiple Intel Xeon Phi coprocessors can be installed in a single host system. Within a single system, the coprocessors can com- municate with each other through the PCIe peer-to-peer inter- connect without any intervention from the host. Similarly, the coprocessors can also communicate through a network card such as InfiniBand or Ethernet, without any intervention from the host. Intel’s initial development cluster named “Endeavor”, which is composed of 140 nodes of prototype Intel Xeon Phi coproces- sors, was ranked 150 in the TOP500 supercomputers in the world based on its Linpack scores. Based on its power consumption, this cluster was as good if not better than other heterogeneous systems in the TOP500. These results on unoptimized prototype systems demonstrate that high levels of performance efficiency can be achieved on compute-dense workloads without the need for a new program- ming language or APIs. The Intel Xeon Phi coprocessor is primarily composed of process- ing cores, caches, memory controllers, PCIe client logic, and a very high bandwidth, bidirectional ring interconnect. Each core comes complete with a private L2 cache that is kept fully coher- ent by a global-distributed tag directory. The memory controllers and the PCIe client logic provide a direct interface to the GDDR5 memory on the coprocessor and the PCIe bus, respectively. All these components are connected together by the ring intercon- nect. Each core in the Intel Xeon Phi coprocessor is designed to be power efficient while providing a high throughput for highly parallel workloads. A closer look reveals that the core uses a short in-order pipeline and is capable of supporting 4 threads in Linpack performance and power of Intel’s cluster Microarchitecture Intel Xeon Phi Coprocessor Core
    • 126 hardware. It is estimated that the cost to support IA architecture legacy is a mere 2% of the area costs of the core and is even less at a full chip or product level. Thus the cost of bringing the Intel Architecture legacy capability to the market is very marginal. Vector Processing Unit An important component of the Intel Xeon Phi coprocessor’s core is its vector processing unit (VPU). The VPU features a nov- el 512-bit SIMD instruction set, officially known as Intel Initial Many Core Instructions (Intel IMCI). Thus, the VPU can execute 16 single-precision (SP) or 8 double-precision (DP) operations per cy- cle. The VPU also supports Fused Multiply-Add (FMA) instructions and hence can execute 32 SP or 16 DP floating point operations per cycle. It also provides support for integers. Vector units are very power efficient for HPC workloads. A single operation can encode a great deal of work and does not incur en- ergy costs associated with fetching, decoding, and retiring many instructions. However, several improvements were required to support such wide SIMD instructions. For example, a mask regis- ter was added to the VPU to allow per lane predicated execution. This helped in vectorizing short conditional branches, thereby improving the overall software pipelining efficiency. The VPU also supports gather and scatter instructions, which are simply non-unit stride vector memory accesses, directly in hardware. IntelXeonPhiCoprocessor The Architecture Vector Processing Unit
    • 127 Thus for codes with sporadic or irregular access patterns, vector scatter and gather instructions help in keeping the code vector- ized. The VPU also features an Extended Math Unit (EMU) that can ex- ecute transcendental operations such as reciprocal, square root, and log, thereby allowing these operations to be executed in a vector fashion with high bandwidth. The EMU operates by calcu- lating polynomial approximations of these functions. The Interconnect The interconnect is implemented as a bidirectional ring. Each direction is comprised of three independent rings. The first, largest, and most expensive of these is the data block ring. The data block ring is 64 bytes wide to support the high bandwidth requirement due to the large number of cores. The address ring is much smaller and is used to send read/write commands and memory addresses. Finally, the smallest ring and the least expen- sive ring is the acknowledgement ring, which sends flow control and coherence messages. When a core accesses its L2 cache and misses, an address re- quest is sent on the address ring to the tag directories. The memory addresses are uniformly distributed amongst the tag directories on the ring to provide a smooth traffic characteristic on the ring. If the requested data block is found in another core’s L2 cache, a forwarding request is sent to that core’s L2 over the address ring and the request block is subsequently forwarded on the data block ring. If the requested data is not found in any caches, a memory address is sent from the tag directory to the memory controller. The figure below shows the distribution of the memory con- trollers on the bidirectional ring. The memory controllers are symmetrically interleaved around the ring. There is an all-to-all mapping from the tag directories to the memory controllers. The addresses are evenly distributed across the memory controllers, thereby eliminating hotspots and providing a uniform access pattern which is essential for a good bandwidth response. During a memory access, whenever an L2 cache miss occurs on a core, the core generates an address request on the address ring and queries the tag directories. If the data is not found in the tag directories, the core generates another address request and queries the memory for the data. Once the memory controller fetches the data block from memory, it is returned back to the core over the data ring. Thus during this process, one data block, two address requests (and by protocol, two acknowledgement messages) are transmitted over the rings. Since the data block rings are the most expensive and are designed to support the re- The Interconnect Distributed Tag Directories
    • 128 quired data bandwidth, we need to increase the number of less expensive address and acknowledgement rings by a factor of two to match the increased bandwidth requirement caused by the higher number of requests on these rings. IntelXeonPhiCoprocessor The Architecture Interleaved Memory Access Interconnect: 2x AD/AK
    • 129 Multi-Threaded Streams Triad The figure below shows the core scaling results for the multi- threaded streams triad workload. These results were generated on a simulator for a prototype of the Intel Xeon Phi coprocessor with only one address ring and one acknowledgement ring per direction in its interconnect. The results indicate that in this case the address and acknowledgement rings would become perfor- mance bottlenecks and would exhibit poor scalability beyond 32 cores. The production grade Intel Xeon Phi coprocessor uses two ad- dress and two acknowledgement rings per direction and pro- vides a good performance scaling up to 50 cores and beyond. It is evident from the figure that the addition of the rings results in an over 40% aggregate bandwidth improvement. Streaming Stores Streaming stores was another key innovation that was em- ployed to further boost memory bandwidth. The pseudo code for Streams Triads is shown below: Streams Triad for (I=0; I<HUGE; I++) A[I] = k*B[I] + C[I]; The stream triad kernel reads two arrays, B and C, and writes a single array A from memory. Historically, a core reads a cache line before it writes the addressed data. Hence there is an ad- ditional read overhead associated with the write. A streaming store instruction allows the cores to write an entire cache line without reading it first. This reduces the number of bytes trans- ferred per iteration from 256 bytes to 192 bytes. Streaming Stores The figure below shows the core scaling results of stream triads workload with streaming stores. As is evident from the results, streaming stores provide a 30% improvement over previous re- sults. Totally, then, by adding two rings per direction and imple- menting streaming stores we are able to improve bandwidth by more than a factor of 2 for multithreaded streams triad. Without Streaming Stores With Streaming Stores Behavior Read A, B, C write A Read B, C write A Bytes transferred to/from memory per iteration 256 192 Multi-threaded Triad – Saturation for 1 AD/AK Ring Multi-threaded Triad – Benefit of Doubling AD/AK
    • Other Design Features Other micro-architectural optimizations incorporated into the Intel Xeon Phi coprocessor include a 64-entry second-level Trans- lation Lookaside Buffer (TLB), simultaneous data cache loads and stores, and 512KB L2 caches. Lastly, the Intel Xeon Phi coproces- sor implements a 16 stream hardware prefetcher to improve the cache hits and provide higher bandwidth.The figure below shows the net performance improvements for the SPECfp 2006 benchmark suite for a single core, single thread runs. The results indicate an average improvement of over 80% per cycle not in- cluding frequency. 130 IntelXeonPhiCoprocessor The Architecture Multi-threaded Triad – with Streaming Stores Per-Core ST Performance Improvement (per cycle)
    • Caches The Intel MIC architecture invests more heavily in L1 and L2 cach- es compared to GPU architectures. The Intel Xeon Phi coproces- sor implements a leading-edge, very high bandwidth memory subsystem. Each core is equipped with a 32KB L1 instruction cache and 32KB L1 data cache and a 512KB unified L2 cache. These caches are fully coherent and implement the x86 memory order model. The L1 and L2 caches provide an aggregate band- width that is approximately 15 and 7 times, respectively, faster compared to the aggregate memory bandwidth. Hence, effective utilization of the caches is key to achieving peak performance on the Intel Xeon Phi coprocessor. In addition to improving band- width, the caches are also more energy efficient for supplying data to the cores than memory. The figure below shows the en- ergy consumed per byte of data transfered from the memory, and L1 and L2 caches. In the exascale compute era, caches will play a crucial role in achieving real performance under restric- tive power constraints. Stencils Stencils are common in physics simulations and are classic examples of workloads which show a large performance gain through efficient use of caches. Stencils are typically employed in simulation of physical sys- tems to study the behavior of the system over time. When these workloads are not programmed to be cache-blocked, they will be bound by memory bandwidth. Cache blocking promises sub- stantial performance gains given the increased bandwidth and energy efficiency of the caches compared to memory. Cache blocking improves performance by blocking the physical struc- ture or the physical system such that the blocked data fits well into a core’s L1 and or L2 caches. For example, during each time- step, the same core can process the data which is already resi- dent in the L2 cache from the last time step, and hence does not need to be fetched from the memory, thereby improving perfor- mance. Additionally, the cache coherence further aids the stencil operation by automatically fetching the updated data from the nearest neighboring blocks which are resident in the L2 caches of other cores. Thus, stencils clearly demonstrate the benefits of efficient cache utilization and coherence in HPC workloads. Power Management Intel Xeon Phi coprocessors are not suitable for all workloads. 131 Stencils Example Caches – For or Against?
    • In some cases, it is beneficial to run the workloads only on the host. In such situations where the coprocessor is not being used, it is necessary to put the coprocessor in a power-saving mode. The figure above shows all the components of the Intel Xeon Phi coprocessor in a running state. To conserve power, as soon as all the four threads on a core are halted, the clock to the core is gat- ed. Once the clock has been gated for some programmable time, the core power gates itself, thereby eliminating any leakage. At any point, any number of the cores can be powered down or powered up. Additionally, when all the cores are power gated and the uncore detects no activity, the tag directories, the inter- 132 IntelXeonPhiCoprocessor The Architecture Power Management: All On and Running Clock Gate Core
    • connect, L2 caches and the memory controllers are clock gated. At this point, the host driver can put the coprocessor into a deep- er sleep or an idle state, wherein all the uncore is power gated, the GDDR is put into a self-refresh mode, the PCIe logic is put in a wait state for a wakeup and the GDDR-IO is consuming very lit- tle power. These power management techniques help conserve power and make Intel Xeon Phi coprocessor an excellent candi- date for data centers. 133 Power Gate Core Package Auto C3
    • InfiniBand high-speed interconntects InfiniBand–High-Speed Interconnects
    • 135 InfiniBand (IB) is an efficient I/O technology that provides high- speed data transfers and ultra-low latencies for computing and storage over a highly reliable and scalable single fabric. The InfiniBand industry standard ecosystem creates cost effective hardware and software solutions that easily scale from genera- tion to generation. InfiniBand is a high-bandwidth, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster community. InfiniBand was designed to take the place of today’s data center networking tech- nology. In the late 1990s, a group of next generation I/O architects formed an open, community driven network technology to provide scalability and stability based on successes from other network designs. Today, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomputers: major Uni- versities and Labs; Life Sciences; Biomedical; Oil and Gas (Seismic, Reservoir, Modeling Applications); Computer Aided Design and En- gineering; Enterprise Oracle; and Financial Applications. InfiniBand (IB) is an effi- cient I/O technology that provides high-speed data transfers and ultra-low la- tencies for computing and storage over a highly reli- able and scalable single fabric. The InfiniBand in- dustry standard ecosystem creates cost effective hard- ware and software solu- tions that easily scale from generation to generation. InfiniBand is a high-band- width, low-latency network interconnect solution that has grown tremendous market share in the High Performance Computing (HPC) cluster community. InfiniBand was designed to take the place of to- day’s data center network- ing technology. In the late 1990s, a group of next gener- ation I/O architects formed an open, community driven network technology to pro- vide scalability and stabil- ity based on successes from other network designs. To- day, InfiniBand is a popular and widely used I/O fabric among customers within the Top500 supercomput- ers: major Universities and Labs; Life Sciences; Biomed- ical; Oil and Gas (Seismic, Reservoir, Modeling Appli- cations); Computer Aided Design and Engineering; En- terprise Oracle; and Finan- cial Applications.
    • 136 InfiniBand was designed to meet the evolving needs of the high performance computing market. Computational science depends on InfiniBand to deliver: ʎʎ High Bandwidth: Supports host connectivity of 10Gbps with Single Data Rate (SDR), 20Gbps with Double Data Rate (DDR), and 40Gbps with Quad Data Rate (QDR), all while offering an 80Gbps switch for link switching ʎʎ Low Latency: Accelerates the performance of HPC and enter- prise computing applications by providing ultra-low latencies ʎʎ Superior Cluster Scaling: Point-to-point latency re- mains low as node and core counts scale – 1.2 μs. High- est real message per adapter: each PCIe x16 adapter drives 26 million messages per second. Excellent commu- nications/computation overlap among nodes in a cluster ʎʎ High Efficiency: InfiniBand allows reliable protocols like Re- mote Direct Memory Access (RDMA) communication to occur between interconnected hosts, thereby increasing efficiency ʎʎ Fabric Consolidation and Energy Savings: InfiniBand can consolidate networking, clustering, and storage data over a single fabric, which significantly lowers overall power, real estate, and management overhead in data centers. En- hanced Quality of Service (QoS) capabilities support run- ning and managing multiple workloads and traffic classes ʎʎ Data Integrity and Reliability: InfiniBand provides the high- est levels of data integrity by performing Cyclic Redundancy Checks (CRCs) at each fabric hop and end-to-end across the fabric to avoid data corruption. To meet the needs of mission critical applications and high levels of availability, InfiniBand provides fully redundant and lossless I/O fabrics with auto- matic failover path and link layer multi-paths InfiniBand High-Speed Interconnects Intel is a global leader and technology innovator in high perfor- mance networking, including adapters, switches and ASICs.
    • 137 Components of the InfiniBand Fabric InfiniBand is a point-to-point, switched I/O fabric architecture. Point-to-point means that each communication link extends be- tween only two devices. Both devices at each end of a link have full and exclusive access to the communication path. To go beyond a point and traverse the network, switches come into play. By adding switches, multiple points can be interconnected to create a fabric. As more switches are added to a network, aggregated bandwidth of the fabric increases. By adding multiple paths between devices, switches also provide a greater level of redundancy. The InfiniBand fabric has four primary components, which are explained in the following sections:  Host Channel Adapter  Target Channel Adapter  Switch  Subnet Manager Host Channel Adapter This adapter is an interface that resides within a server and communicates directly with the server’s memory and proces- sor as well as the InfiniBand Architecture (IBA) fabric. The adapt- er guarantees delivery of data, performs advanced memory ac- cess, and can recover from transmission errors. Host channel adapters can communicate with a target channel adapter or a switch. A host channel adapter can be a standalone InfiniBand card or it can be integrated on a system motherboard. Intel Tru- eScale InfiniBand host channel adapters outperform the com- petition with the industry’s highest message rate. Combined with the lowest MPI latency and highest effective bandwidth, Intel host channel adapters enable MPI and TCP applications to scale to thousands of nodes with unprecedented price per- formance. Target Channel Adapter This adapter enables I/O devices, such as disk or tape storage, to be located within the network independent of a host computer. Target channel adapters include an I/O controller that is specific to its particular device’s protocol (for example, SCSI, Fibre Chan- nel (FC), or Ethernet). Target channel adapters can communicate with a host channel adapter or a switch. Switch An InfiniBand switch allows many host channel adapters and target channel adapters to connect to it and handles network traffic. The switch looks at the “local route header” on each packet of data that passes through it and forwards it to the appropriate location. The switch is a critical component of the InfiniBand implementation that offers higher availability, higher aggregate bandwidth, load balancing, data mirroring, and much more. A group of switches is referred to as a Fabric. If a host computer is down, the switch still continues to operate. The switch also frees up servers and other devices by handling network traffic. Host Channel Adapter InfiniBand Switch With Subnet Manager Target Channel Adapter figure 1 Typical InfiniBand High Performance Cluster
    • 138 Top 10 Reasons to Use Intel TrueScale InfiniBand 1. Predictable Low Latency Under Load – Less Than 1.0 μs. TrueScale is designed to make the most of multi-core nodes by providing ultra-low latency and significant message rate scalability. As additional compute resources are added to the Intel TrueScale InfiniBand solution, latency and message rates scale linearly. HPC applications can be scaled without having to worry about diminished utilization of compute re- sources. 2. Quad Data Rate (QDR) Performance. The Intel 12000 switch family runs at lane speeds of 10Gbps, providing a full bi- sectional bandwidth of 40Gbps (QDR). In addition, the 12000 switch has the unique capability of riding through periods of congestion with features such as deterministically low la- tency. The Intel family of TrueScale 12000 products offers the lowest latency of any IB switch and high performance trans- fers with the industry’s most robust signal integrity. 3. Flexible QoS Maximizes Bandwidth Use. The Intel 12800 advanced design is based on an architecture that provides comprehensive virtual fabric partitioning capabilities that enable the IB fabric to support the evolving requirements of an organization. 4. Unmatched Scalability – 18 to 864 Ports per Switch. Intel of- fers the broadest portfolio (five chassis and two edge switch- es) from 18 to 864 TrueScale InfiniBand ports, allowing cus- tomers to buy switches that match their connectivity, space, and power requirements. InfiniBand Top 10 Reasons to Use Intel TrueScale InfiniBand “Effective fabric management has become the most important factor in maximizing performance in an HPC cluster. With IFS 7.0, Intel has addressed all of the major fabric management issues in a prod- uct that in many ways goes beyond what others are offering.” Michael Wirth HPC Presales Specialist
    • 139 5. Highly Reliable and Available. Reliability And Service- ability (RAS) that is proven in the most demanding Top500 and Enterprise environments is designed into Intel’s 12000 series with hot swappable components, redundant com- ponents, customer replaceable units, and non-disruptive code load. 6. Lowest Per-Port and Cooling Requirements. The True-Scale 12000 offers the lowest power consumption and the highest port density – 864 total TrueScale InfiniBand ports in a single chassis makes it unmatched in the industry. This results in delivering the lowest power per port for a director switch (7.8 watts per port) and the lowest power per port for an edge switch (3.3 watts per port). 7. Easy to Install and Manage. Intel installation, configuration, and monitoring Wizards reduce time-to-ready. The Intel In- finiBand Fabric Suite (IFS) assists in diagnosing problems in the fabric. Non-disruptive firmware upgrades provide maxi- mum availability and operational simplicity. 8. Protects Existing InfiniBand Investments. Seamless vir- tual I/O integration at the Operating System (OS) and appli- cation levels that match standard network interface cards and adapter semantics with no OS or app changes – they just work. Additionally, the TrueScale family of products is compliant with the InfiniBand Trade Association (IBTA) open specification, so Intel products inter-operate with any IBTA- compliant InfiniBand vendor. Being IBTA-compliant makes the Intel 12000 family of switches ideal for network consoli- dation; sharing and scaling I/O pools across servers; and to pool and share I/O resources between servers. 9. Modular Configuration Flexibility. The Intel 12000 series switches offer configuration and scalability flexibility that meets the requirements of either a high-density or high- performance compute grid by offering port modules that address both needs. Units can be populated with Ultra High Density (UHD) leafs for maximum connectivity or Ultra High Performance (UHP) leafs for maximum performance. The high scalability 24-port leaf modules support configurations between 18 and 864 ports, providing the right size to start and the capability to grow as your grid grows. 10. Option to Gateway to Ethernet and Fibre Channel Net- works. Intel offers multiple options to enable hosts on In- finiBand fabrics to transparently access Fibre Channel based storage area networks (SANs) or Ethernet based local area networks (LANs). Intel TrueScale architecture and the resulting family of products delivers the promise of InfiniBand to the enterprise today.
    • 140 InfiniBand Intel MPI Library 4.0 Performance Introduction Intel’s latest MPI release, Intel MPI 4.0, is now optimized to work with Intel’s TrueScale InfiniBand adapter. Intel MPI 4.0 can now directly call Intel’s TrueScale Performance Scale Messaging (PSM) interface. The PSM interface is designed to optimize MPI applica- tion performance. This means that organizations will be able to achieve a significant performance boost with a combination of Intel MPI 4.0 and Intel TrueScale InfiniBand. Solution Intel worked with Intel to tune and optimize the company’s lat- est MPI release – Intel MPI Library 4.0 – to improve performance when used with Intel TrueScale InfiniBand. With MPI Library 4.0, applications can make full use of High Performance Computing (HPC) hardware, improving the overall performance of the appli- cations on the clusters. Intel MPI Library 4.0 Performance Intel MPI Library 4.0 uses the high performance MPI-2 specifica- tion on multiple fabrics, which results in better performance for applications on Intel architecture-based clusters. This library en- ables quick delivery of maximum end user performance, even if MPI v3.1 MPI v4.0 1400 35% Improvement 1200 1000 800 600 400 200 0 Better Elapsedtime Number of cores 16 32 64 Figure 3 Performance with and without GPUDirect
    • 141 there are changes or upgrades to new interconnects – without requiring major changes to the software or operating environ- ment. This high-performance, message-passing interface library develops applications that can run on multiple cluster fabric in- terconnects chosen by the user at runtime. Testing Intel used the Parallel Unstructured Maritime Aerodynamics (PUMA) Benchmark program to test the performance of Intel MPI Library versions 4.0 and 3.1 with Intel TrueScale InfiniBand. The program analyses internal and external non-reacting compress- ible flows over arbitrarily complex 3D geometries. PUMA is writ- ten in ANSI C, and uses MPI libraries for message passing. Results The test showed that MPI performance can improve by more than 35 percent using Intel MPI Library 4.0 with Intel TrueScale InfiniBand, compared to using Intel MPI Library 3.1. Intel TrueScale InfiniBand Adapters Intel TrueScale InfiniBand Adapters offer scalable performance, reliability, low power, and superior application performance. These adapters ensure superior performance of HPC applica- tions by delivering the highest message rate for multicore compute nodes, the lowest scalable latency, large node count clusters, the highest overall bandwidth on PCI Express Gen1 plat- forms, and superior power efficiency. Purpose: Compare the performance of Intel MPI Library 4.0 and 3.1 with Intel InfiniBand Benchmark: PUMA Flow Cluster: Intel/IBM iDataPlex™ cluster System Configuration: The NET track IBM Q-Blue Cluster/iDataPlex nodes were configured as follows: Processor Intel Xeon® CPU X5570 @ 2,93 GHz Memory 24GB (6x4GB) @ 1333MHz (DDR3) QDR InfiniBand Switch Intel Model 12300/firmware version QDR InfiniBand Host Cnel Adapter Intel QLE7340 software stack Operating System Red Hat® Enterprise Linux® Server release 5.3 Kernel 2.6.18-128.el5 File System IFS Mounted Typical InfiniBand High Performance Cluster
    • 142 TheIntelTrueScale12000familyofMulti-ProtocolFabricDirectors is the most highly integrated cluster computing interconnect so- lution available. An ideal solution HPC, database clustering, and grid utility computing applications, the 12000 Fabric Directors maximize cluster and grid computing interconnect performance whilesimplifyingandreducingthecostofoperatingadatacenter. Subnet Manager The subnet manager is an application responsible for config- uring the local subnet and ensuring its continued operation. Configuration responsibilities include managing switch setup and reconfiguring the subnet if a link goes down or a new one is added. How High Performance Computing Helps Vertical Applications Enterprises that want to do high performance computing must balance the following scalability metrics as the core size and the number of cores per node increase: ʎʎ Latency and Message Rate Scalability: must allow near line- ar growth in productivity as the number of compute cores is scaled. ʎʎ Power and Cooling Efficiency: as the cluster is scaled, power and cooling requirements must not become major concerns in today’s world of energy shortages and high-cost energy. InfiniBand offers the promise of low latency, high bandwidth, and unmatched scalability demanded by high performance computing applications. IB adapters and switches that perform well on these key metrics allow enterprises to meet their high- performance and MPI needs with optimal efficiency. The IB so- lution allows enterprises to quickly achieve their compute and business goals. InfiniBand High-Speed Interconnects
    • 143 Vertical Market Application Segment InfiniBand Value Oil & Gas Mix of Independent Service Provider (ISP) and home grown codes: reservoir modeling Low latency, high bandwidth Computer Aided Engineering (CAE) Mostly Independent Software Vendor (ISV) codes: crash, air flow, and fluid flow simulations High message rate, low latency, scalability Government Home grown codes: labs, defense, weather, and a wide range of apps High message rate, low latency, scalability, high bandwidth Education Home grown and open source codes: a wide range of apps High message rate, low latency, scalability, high bandwidth Financial Mix of ISP and home grown codes: market simulation and trading floor High performance IP, scalability, high bandwidth Life and Materials Science Mostly ISV codes: molecular simulation, computational chemistry, and biology apps Low latency, high message rates
    • 144 InfiniBand Intel Fabric Suite 7 Intel Fabric Suite 7 Maximizing Investments in High Performance Computing Around the world and across all industries, high performance computing (HPC) is used to solve today’s most demanding com- puting problems. As today’s high performance computing chal- lenges grow in complexity and importance, it is vital that the software tools used to install, configure, optimize, and manage HPC fabrics also grow more powerful. Today’s HPC workloads are too large and complex to be managed using software tools that are not focused on the unique needs of HPC. Designed specifically for HPC, Intel Fabric Suite 7 is a complete fabric management solution that maximizes the return on HPC investments, by allowing users to achieve the highest levels of performance, efficiency, and ease of management from In- finiBand-connected HPC clusters of any size. Highlights ʎʎ Intel FastFabric and Fabric Viewer integration with leading third-party HPC cluster management suites ʎʎ Simple but powerful Fabric Viewer dashboard for monitoring fabric performance ʎʎ Intel Fabric Manager integration with leading HPC workload- management suites that combine virtual fabrics and compute ʎʎ Quality of Service (QoS) levels that maximize fabric efficiency and application performance ʎʎ Smart, powerful software tools that make Intel TrueScale Fabric solutions easy to install, configure, verify, optimize, and manage ʎʎ Congestion control architecture that reduces the effects of fabric congestion caused by low credit availability that can result in head-of-line blocking ʎʎ Powerful fabric routing methods—including adaptive and
    • 145 dispersive routing—that optimize traffic flows to avoid or eliminate congestion, maximizing fabric throughput ʎʎ Intel Fabric Manager’s advanced design ensures fabrics of all sizesand topologies—from fat-tree to mesh and torus— scale to support the most demanding HPC workloads Superior Fabric Performance and Simplified Management are Vital for HPC As HPC clusters scale to take advantage of multi-core and GPU- accelerated nodes attached to ever-larger and more complex fabrics, simple but powerful management tools are vital for maximizing return on HPC investments. Intel Fabric Suite 7 provides the performance and management tools for today’s demanding HPC cluster environments. As clus- ters grow larger, management functions, from installation and configuration to fabric verification and optimization, are vital in ensuring that the interconnect fabric can support growing workloads. Besides fabric deployment and monitoring, IFS opti- mizes the performance of message passing applications—from advanced routing algorithms to quality of service (QoS)—that ensure all HPC resources are optimally utilized. Scalable Fabric Performance ʎʎ Purpose-built for HPC, IFS is designed to make HPC clusters faster, easier, and simpler to deploy, manage, and optimize ʎʎ The Intel TrueScale Fabric on-load host architecture exploits the processing power of today’s faster multi-core processors for superior application scaling and performance ʎʎ Policy-driven vFabric virtual fabrics optimize HPC resource utilization by prioritizing and isolating compute and storage traffic flows ʎʎ Advanced fabric routing options—including adaptive and dispersive routing—that distribute traffic across all poten- tial links that improve the overall fabric performance, lower- ing congestion to improve throughput and latency Scalable Fabric Intelligence ʎʎ Routing intelligence scales linearly as Intel TrueScale Fabric switches are added to the fabric ʎʎ Intel Fabric Manager can initialize fabrics having several thousand nodes within seconds ʎʎ Advanced and optimized routing algorithms overcome the limitations of typical subnet managers ʎʎ Smart management tools quickly detect and respond to fab- ric changes, including isolating and correcting problems that can result in unstable fabrics Standards-based Foundation Intel TrueScale Fabric solutions are compliant with all industry software and hardware standards. Intel Fabric Suite (IFS) rede- fines IBTA management by coupling powerful management tools with intuitive user interfaces. InfiniBand fabrics built using IFS deliver the highest levels of fabric performance, efficiency, and management simplicity, allowing users to realize the full benefits from their HPC investments. Major components of Intel Fabric Suite 7 include: ʎʎ FastFabric Toolset ʎʎ Fabric Viewer ʎʎ Fabric Manager Intel FastFabric Toolset Ensures rapid, error-free installation and configuration of Intel True Scale Fabric switch, host, and management software tools. Guided by an intuitive interface, users can easily install, config- ure, validate, and optimize HPC fabrics.
    • 146 Key features include: ʎʎ Automated host software installation and configuration ʎʎ Powerful fabric deployment, verification, analysis, and report- ing tools for measuring connectivity, latency, and bandwidth ʎʎ Automated chassis and switch firmware installation and up- date ʎʎ Fabric route and error analysis tools ʎʎ Benchmarking and tuning tools ʎʎ Easy-to-use fabric health checking tools Intel Fabric Viewer A key component that provides an intuitive, Java-based user interface with a “topology-down” view for fabric status and di- agnostics, with the ability to drill down to the device layer to identify and help correct errors. IFS 7 includes fabric dashboard, a simple and intuitive user interface that presents vital fabric performance statistics. Key features include: ʎʎ Bandwidth and performance monitoring ʎʎ Device and device group level displays ʎʎ Definition of cluster-specific node groups so that displays can be oriented toward end-user node types ʎʎ Support for user-defined hierarchy ʎʎ Easy-to-use virtual fabrics interface Intel Fabric Manager Provides comprehensive control of administrative functions us- ing a commercial-grade subnet manager. With advanced routing algorithms, powerful diagnostic tools, and full subnet manager failover, Fabric Manager simplifies subnet, fabric, and individual component management, making even the largest fabrics easy to deploy and optimize. InfiniBand Intel Fabric Suite 7
    • 147 Key features include: ʎʎ Designed and optimized for large fabric support ʎʎ Integrated with both adaptive and dispersive routing systems ʎʎ Congestion control architecture (CCA) ʎʎ Robust failover of subnet management ʎʎ Path/route management ʎʎ Fabric/chassis management ʎʎ Fabric initialization in seconds, even for very large fabrics ʎʎ Performance and fabric error monitoring Adaptive Routing While most HPC fabrics are configured to have multiple paths between switches, standard InfiniBand switches may not be ca- pable of taking advantage of them to reduce congestion. Adap- tive routing monitors the performance of each possible path, and automatically chooses the least congested route to the destination node. Unlike other implementations that rely on a purely subnet manager-based approach, the intelligent path se- lection capabilities within Fabric Manager, a key part of IFS and Intel 12000 series switches, scales as your fabric grows larger and more complex. Key adaptive routing features include: ʎʎ Highly scalable—adaptive routing intelligence scales as the- fabric grows ʎʎ Hundreds of real-time adjustments per second per switch ʎʎ Topology awareness through Intel Fabric Manager ʎʎ Supports all InfiniBand Trade Association* (IBTA*)-compliant adapters Dispersive Routing One of the critical roles of the fabric management is the initial- ization and configuration of routes through the fabric between a single pair of nodes. Intel Fabric Manager supports a variety of routing methods, including defining alternate routes that disperse traffic flows for redundancy, performance, and load balancing. Instead of sending all packets to a destination on a single path, Intel dispersive routing distributes traffic across all possible paths. Once received, packets are reassembled in their proper order for rapid, efficient processing. By leveraging the entire fabric to deliver maximum communications performance for all jobs, dispersive routing ensures optimal fabric efficiency. Key dispersive routing features include: ʎʎ Fabric routing algorithms that provide the widest separation of paths possible ʎʎ Fabric “hotspot” reductions to avoid fabric congestion ʎʎ Balances traffic across all potential routes ʎʎ May be combined with vFabrics and adaptive routing ʎʎ Very low latency for small messages and time sensitive con- trol protocols ʎʎ Messages can be spread across multiple paths—Intel Perfor- mance Scaled Messaging (PSM) ensures messages are reas- sembled correctly ʎʎ Supports all leading MPI libraries Mesh and Torus Topology Support Fat-tree configurations are the most common topology used in HPC cluster environments today, but other topologies are gaining broader use as organizations try to create increasingly larger, more cost-effective fabrics. Intel is leading the way with full support of emerging mesh and torus topologies that can help reduce networking costs as clusters scale to thousands of nodes. IFS has been enhanced to support these emerging op- tions—from failure handling that maximizes performance, to even higher reliability for complex networking environments.
    • 148 InfiniBand Intel Fabric Suite 7 transtec HPC solutions excel through their easy management and high usability, while maintaining high performance and quality throughout the whole lifetime of the system. As clusters scale, issues like congestion mitigation and Quality-of-Service can make a big difference in whether the fabric performs up to its full potential. With the intelligent choice of Intel InfiniBand products, transtec remains true to combining the best components together to pro- vide the full best-of-breed solution stack to the customer. transtec HPC engineering experts are always available to fine- tune customers’ HPC cluster systems and InfiniBand fabrics to get the maximum performance while at the same time provide them with an easy-to-manage and easy-to-use HPC solution.
    • 149
    • Parstream– BigDataAnalytics
    • 151 The amount of digital data resources has exploded in the past decade. The 2010 IDC Digital Universe Study shows that compared to 2008, the digital universe has grown by 62% or up to 800,000 petabytes (0.8 zettabyte) in 2009. By the year 2020, IDC predicts the amount of data to be 44 timesasbigasitwasin2009, thus reaching approximate- ly 35 zettabytes. As a result of the exploding amount of data, the demand for appli- cations used for searching and analyzing large datas- ets will significantly grow in all industries. However, today’s applica- tions require large and ex- pensiveinfrastructureswhich consume vast amounts of energy and resources, creat- ing challenges in the analysis of mass-data. In the future, increasingly more terabyte- scale datasets will be used for research,analysisanddiagno- sis bringing about further dif- ficulties.
    • 152 Parstream Big Data Analytics What is Big Data The amount of generated and stored data is growing exponen- tially in unprecedented proportions. According to Eric Schmidt, former CEO of Google, the amount of data that is now produced in 2 days is as large as the total amount since the beginning of civilization until 2003. Large amounts of data are produced in many areas of business and private life. For example, there are 200 million Twitter mes- sages generated – daily. Mobile devices, sensor networks, ma- chine-to-machine communications, RFID readers, software logs, cameras and microphones generate a continuous stream of data from multiple terabytes per day. For organizations Big Data means a great opportunity and a big challenge. Companies that know how to manage big data will create more comprehensive predictions for the market and busi- ness. Decision-makers can quickly and safely respond to the de- velopments in dynamic and volatile markets. It is important that companies know how to use Big Data to improve their competitive positions, increase productivity and develop new business models. According to Roger Magoulas of O’Reilly the ability to analyze big data has developed into a core Real Time < 1...10 milli sec 10...100 milli sec 1 sec Terabyte 1...10 sec 1...10 min > 10 min Gigabyte Petabyte Lag Time Big Data Volume Interactive Analytics Operational Data Volumes Batch Analytics (MapReduce) OLTP Reporting In-Memory DB Complex Event Processing
    • 153 competence in the information age and can provide an enor- mous competitive advantage for companies. Big Data analytics has a huge impact on traditional data process- ing and causes business and IT managers to face new problems. The latest IT technologies and processes are poorly suited to analyze very large amounts of data, according to a recent study by Gartner. IT departments often try to meet the increasing flood of infor- mation by traditional means, using the same infrastructure de- velopment and just purchasing more hardware. As data volumes grow stronger than the processing capabilities of additional servers or better processors, new technological approaches are needed. Setbacks in current database architectures Current databases are not engineered for mass data, but rather for small data volumes up to 100 million records. Today’s data- bases use outdated, 20-30 year old architectures and the data and index structures are not constructed for efficient analysis of such data volumes. And, because these databases employ se- quential algorithms, they are not able to exploit the potential of parallel hardware. Algorithmic procedures for indexing large amounts of data have seen relatively few innovations in the past years and decades. Due to the ever-growing amounts of data to be pro- cessed, there are rising challenges that traditional database systems cannot cope with. Currently, new and innovative ap- proaches to solve these problems are developed and evaluat- ed. However, some of these approaches seem to be heading in the wrong direction. Parstream – Big Data Analytics Platform ParStream offers a revolutionary approach to high-performance data analysis. It addresses the problems arising from rapidly in- creasing data volumes in modern business, as well as scientific application scenarios. ParStream ʎʎ has a unique index technology ʎʎ comes with efficient parallel processing ʎʎ is ultra-fast, even with billions of records ʎʎ scales linearly up to petabytes ʎʎ offers real-time analysis and continuous import ʎʎ is cost and energy efficient ParStream uses a unique indexing technology which enables efficient multi-threaded processing on parallel architectures. Hardware and energy costs are substantially reduced while overall performance is optimized to index and execute queries. Close-to-real-time analysis is obtained through simultaneous importing, indexing and querying of data. The developers of Par- Stream strive to continuously improve on the product by work- ing closely with universities, research partners, and clients. ParStream is the right choice when... ʎʎ extreme amounts of data are to be searched and filtered ʎʎ filters use many columns in various combinations (ad-hoc queries) SQL API / JDBC / ODBC Real-Time Analytics Engine High Performance Compressed Index (HPCI) Fast Hybrid Storage (Columnar/ Row) High Speed Loader with Low Latency C++ UDF -API Multi-Dimensional Partitioning Shared Nothing Architecture In-Memory and Disc Technology Massively Parallel Processing (MPP)
    • 154 Technology The technical architecture of the ParStream database can be roughly divided into the following areas: During the extract, transform and load (ETL) process, the supplied data is read and processed so that it can be forwarded to the actual loader. In the next step, ParStream generates and stores necessary index structures and data representations. Input data can be stored in row and/or column-oriented data stores. Here configurable opti- mizations regarding sorting, compression and partitioning are ac- counted for. Once data is loaded into the server, the user can pose queries over a standard interface (SQL) to the database engine. A parser then interprets the queries and a query executor executes it in parallel. The optimizer is used to generate the initial query plan starting from the declarative request. This initial execution plan is used to start processing. However, this initial plan can be altered dy- namically during runtime, e.g., changing the level of parallelism. Parstream Big Data Analytics High Speed Import & Index Data Source Results Data Source Data Source ETL ETL ETL Rowori- ented record store Columnori- ented record store Query Logic Loader Multiple Servers, CPUs and GPUs Compression Partitioning Caching Parallelzation Index Store ParStream ETL Query
    • 155 The query executor also makes optimal use of the available infra- structure. All available parts of the query exploit bitmap information where possible. Logical filter conditions and aggregations can thus be calculated and forwarded to the client by highly efficient bitmap operations. Index Structure ParStream’s innovative index structure enables efficient parallel processing, allowing for unsurpassed levels of performance. Key features of the underlying database engine include: ʎʎ A column-oriented bitmap index using a unique data struc- ture. ʎʎ A data structure that allows allows processing in compressed form. ʎʎ No need for index decompression as in other database and indexing systems. Use of ParStream and its highly efficient query engine yield sig- nificant advantages over regular database systems. ParStream offers: ʎʎ faster index operations, ʎʎ shorter response times, ʎʎ substantially less CPU-load, ʎʎ efficient parallel processing of search and analysis, ʎʎ capability of adding new data to the index during query ex- ecution, close-to-real-time importing and querying of data. Optimized Use of Resources Inside one single processing node, as well as inside single data partitions, query processing can be parallelized to achieve mini- mal response time, by using all available resources (CPU, I/O- Channels). Reliability The reliability of a ParStream database is guaranteed through several product features and fully supports multiple-server en- vironments. The ParStream database allows the usage of an arbi- trary number of servers. Each server can be configured to repli- cate the entire or only parts of the data store. Several load-balancing algorithms will pass a complete query to one of the servers or parts of one query to several, even re- dundant servers. The query can be sent to any of the cluster members, allowing customers to use a load-balancing and fail- over configuration according to their own need, e.g. round robin query distribution. Interfaces The ParStream server can be queried using any of the following three methods: At a high level, we provide a JDBC driver which enables the imple- mentation of a cross platform front end. This allows ParStream to be used with any application capable of using a standard JDBC interface. For example, a Java applet can be built. This ap- plet can be used to query the ParStream server from within any web browser. At mid-level, queries can be submitted as SQL code. Additionally, other descriptive query implementations are available, e.g., a JSON format. This allows the user to define queries, which can- not easily be expressed in standard SQL. The results are then sent back to the client as CSV text or in binary format. At a low level, ParStream’s C++ API and base classes can be used to write user defined query nodes that are stored in dynamic li- braries. A developer can thus integrate his own query nodes into a tree description and register it dynamically into the ParStream
    • 156 server. Such user-defined queries can be executed via a TCP/IP connection and are also integrated into ParStream’s parallel ex- ecution framework. This interface layer allows the formulation of queries that cannot be expressed using SQL. Data Import One of ParStream’s strengths is its ability to import CSV files at unprecedented speeds. This is based on two factors: First of all, the index is much faster in adding data than indexes used in most other databases. Secondly, the importer partitions and sorts the data in parallel, which exploits the capabilities of today’s multi-core processors. Additionally, the import process may run outside the query pro- cess enabling the user to ship the finished data and index files to the servers. In this way, the import’s CPU and I/O load can be separated and moved to different machines. Parstream Big Data Analytics JDBC/ODBC SQL 2003 Parallel Import CSV & Binary Import ETL- Support Socket C++ API SOA Compatibility Front End Map-Reduce RDBMS Raw-Data Real-Time Big Data Analytics Application Tool “As Big Data Analytics is becoming more and more important, new and innovative technology arises, and we are happy that with Parstream, our cus- tomers may experience a performance boost with respect to their analytics work by a factor of up to 1,000.” Matthias Groß HPC Sales Specialist
    • 157 Another remarkable feature of ParStream is its ability to optionally operate on a CSV record store instead of column stores. This accelerates the import because only the indexes need to be written. Plus, since one usually wants to save the original CSV files, no additional hard drive memory is wasted. Supported Platforms Currently ParStream is available on a number of Linux Distri- butions including RedHat Enterprise Linux, Novell Enterprise Linux and Debian Lenny running on X86_64 CPUs. On request, ParStream will be ported to other platforms. Besides the detailed simulation of realistic processes, data an- alytics is is one of the classic application of High Performance Computing. What is new is that with the amount of data in- creasing such that we now speak of Big Data Analytics, and the required time-to-result getting shorter and shorter, specialized solutions arise that apply the scale-out principle of HPC to areas that up to now have been in the focus of High Performance Com- puting. Parstream as a clustered database – optimized for highest read- only parallelized data throughput, next-to-realtime query results – is such a specialized solution. transtec ensures that this latest innovative technology is imple- mented and run on the most reliable systems available, sized exactly according to the customer’s specific requirements. Par- stream as the software technology core together with transtec HPC systems and transtec services combined, provide for the most reliable and comprehensive Big Data Analytics solutions available.
    • 158 ʎʎ complex queries are performed frequently ʎʎ datasets are continuously growing ʎʎ close-to-real-time analytical results are expected ʎʎ infrastructure and operating costs need to be optimized Key business scenarios Big Data is a game changer in all industries and the public sector. ParStream has identified many applications across all sectors that yield high returns for the customer. Examples are ad-spend- ing analytics and social media monitoring in eCommerce, fraud detection and algorithmic trading in Finance/Capital markets, network quality monitoring and customer targeting in Telcos, smart metering and smart grids in Utilities, and many more. ʎʎ Search and Selection – ParStream enables new levels of search and online shopping satisfaction By providing facet- ted search in combination with continuously analyzing cus- tomers preferences and behavior in real-time, ParStream can guide customers very effectively to products and services of their interest leading to increased conversion rate and plat- form revenue. ʎʎ Real-Time Analytics – ParStream drives interactive analytics processes to gain insights faster. Identifying valuable insights from Big Data is ideally an interactive process where hypothe- ses are identified and validated. ParStream delivers results instantaneously on up-to-date Big Data even if many power- users or automated-agents are querying the data at the same time. ParStream enables faster organizational learning for process and product optimization. ʎʎ Online-Processing – ParStream responds automatically to large data streams. For online advertising, re-targeting, trend- spotting, fraud detection and many more cases ParStream can automatically process new data, compare to large histo- ric data sets, and enable react quickly. Fast growing data vo- lumes and increasing algorithmic complexity create a market for faster solutions. Parstream Big Data Analytics eCommerce Services facetted Search Web analytics SEO- analytics Online- Advertising Social Networks Ad serving Profiling Targeting Telco Customer attrition prevention Network monitoring Targeting Prepaid account mgmt Finance Trend analysis Fraud detection Automatic trading Risk analysis Energy Oil and Gas Smart metering Smart grids Wind parks Mining Solar Panels Many More Production Mining M2M Sensors Genetics Intelligence Wather ALL INDUSTRIES MANYAPPLICATIONS
    • 159 Offering ParStream delivers a unique combination of real-time analyt- ics, low-latency continuous import and high throughput for Big Data. ParStream features analysis of billions of records with thousands of columns in sub-second time with very small com- puting infrastructure requirements. ParStream does NOT use cubes to store data, which enables ParStream to analyze data in its full granularity, flexibly in all dimensions and to import new data on the fly. ParStream delivers results typically 1,000 times faster than Post- greSQL or MySQL. Compared to column-based databases, Par- Stream delivers superior performance; this performance differ- ential with traditional technology increases exponentially with increasing data volume and query complexity. ParStream has unique intellectual property in compression and indexing. With High-Performance-Compressed-Index (HPCI) technology, data indices can be analyzed in a compressed form, removing the need to go through the extra step of decompres- sion. This technological advantage results in major performance advantages compared to other solutions, including: ʎʎ Substantially reducing CPU load (previously needed for de- compression, index operations more efficient than full data scanning) ʎʎ Substantially reducing RAM volume (previously needed to store the decompressed index) ʎʎ Providing much faster query response rates because HPCI processing can be parallelized eliminating the delays that re- sult with decompression Compared to state of the art data warehouse solutions, Par- Stream delivers results much faster, with more flexibility and granularity, on more up-to-date data, with lower infrastructure requirements and at lower TCO. ParStream uniquely addresses Big Data scenarios with its ability to continuously import data and make it available to analytics in real-time. This allows for analytics, filtering and conditioning of “live” data streams, for example from social network or financial data feeds. ParStream licensing is based on data volume. The product can be deployed in four ways that also can be combined: ʎʎ As a traditional on premise Software deployment managed by the customer (including Cloud deployments), ʎʎ As an Appliance managed by the customer, ʎʎ As a Cloud Service provided by a ParStream Service partner, and ʎʎ As a Cloud Service provided by ParStream. These options provide flexibility and eases adoption as custom- ers can use ParStream on their preferred infrastructure or cloud, regardless of number of servers, cores, etc. Although private and public cloud-infrastructures are seen as the future model for many applications, dedicated cluster-infrastructures are fre- quently the preferred option for real-time Big Data scenarios. © 2012 by Parstream GmbH Long Query Runtime Query Query Data Big Data Parallel Import Nightly Batch - Import STANDARD DW ARCHITECHTURE PARSTREAM ARCHITECTURE HPCI Frequent Full Table Scans Data is at Least 1 Day Old Each Query Uses Multiple Processor Cores Query execution compressed indices Continuous Import Assures Zimeliness of Data
    • 160 ACML (“AMD Core Math Library“) A software development library released by AMD. This library provides useful mathematical routines optimized for AMD pro- cessors. Originally developed in 2002 for use in high-performance computing (HPC) and scientific computing, ACML allows nearly optimal use of AMD Opteron processors in compute-intensive applications. ACML consists of the following main components: ʎʎ A full implementation of Level 1, 2 and 3 Basic Linear Algebra Subprograms (→ BLAS), with optimizations for AMD Opteron processors. ʎʎ A full suite of Linear Algebra (→ LAPACK) routines. ʎʎ A comprehensive suite of Fast Fourier transform (FFTs) in sin- gle-, double-, single-complex and double-complex data types. ʎʎ Fast scalar, vector, and array math transcendental library routines. ʎʎ Random Number Generators in both single- and double-pre- cision. AMD offers pre-compiled binaries for Linux, Solaris, and Win- dows available for download. Supported compilers include gfor- tran, Intel Fortran Compiler, Microsoft Visual Studio, NAG, PathS- cale, PGI compiler, and Sun Studio. BLAS (“Basic Linear Algebra Subprograms“) Routines that provide standard building blocks for performing basic vector and matrix operations. The Level 1 BLAS perform scalar, vector and vector-vector operations, the Level 2 BLAS perform matrix-vector operations, and the Level 3 BLAS per- form matrix-matrix operations. Because the BLAS are efficient, portable, and widely available, they are commonly used in the development of high quality linear algebra software, e.g. → LA- PACK . Although a model Fortran implementation of the BLAS in available from netlib in the BLAS library, it is not expected to perform as well as a specially tuned implementation on most high-performance computers – on some machines it may give much worse performance – but it allows users to run → LAPACK software on machines that do not offer any other implementa- tion of the BLAS. Cg (“C for Graphics”) A high-level shading language developed by Nvidia in close col- laboration with Microsoft for programming vertex and pixel shaders. It is very similar to Microsoft’s → HLSL. Cg is based on the C programming language and although they share the same syntax, some features of C were modified and new data types were added to make Cg more suitable for programming graphics processing units. This language is only suitable for GPU program- ming and is not a general programming language. The Cg com- piler outputs DirectX or OpenGL shader programs. CISC (“complex instruction-set computer”) A computer instruction set architecture (ISA) in which each in- struction can execute several low-level operations, such as a load from memory, an arithmetic operation, and a memory store, all in a single instruction. The term was retroactively coined in con- trast to reduced instruction set computer (RISC). The terms RISC and CISC have become less meaningful with the continued evo- lution of both CISC and RISC designs and implementations, with modern processors also decoding and splitting more complex in- structions into a series of smaller internal micro-operations that can thereby be executed in a pipelined fashion, thus achieving high performance on a much larger subset of instructions. Glossary
    • 161 cluster Aggregation of several, mostly identical or similar systems to a group, working in parallel on a problem. Previously known as Beowulf Clusters, HPC clusters are composed of commodity hardware, and are scalable in design. The more machines are added to the cluster, the more performance can in principle be achieved. control protocol Part of the → parallel NFS standard CUDA driver API Part of → CUDA CUDA SDK Part of → CUDA CUDA toolkit Part of → CUDA CUDA (“Compute Uniform Device Architecture”) A parallel computing architecture developed by NVIDIA. CUDA is the computing engine in NVIDIA graphics processing units or GPUs that is accessible to software developers through indus- try standard programming languages. Programmers use “C for CUDA” (C with NVIDIA extensions), compiled through a PathScale Open64 C compiler, to code algorithms for execution on the GPU. CUDA architecture supports a range of computational interfaces including → OpenCL and → DirectCompute. Third party wrappers are also available for Python, Fortran, Java and Matlab. CUDA works with all NVIDIA GPUs from the G8X series onwards, includ- ing GeForce, Quadro and the Tesla line. CUDA provides both a low level API and a higher level API. The initial CUDA SDK was made public on 15 February 2007, for Microsoft Windows and Linux. Mac OS X support was later added in version 2.0, which super- sedes the beta released February 14, 2008. CUDA is the hardware and software architecture that enables NVIDIA GPUs to execute programs written with C, C++, Fortran, → OpenCL, → DirectCompute, and other languages. A CUDA pro- gram calls parallel kernels. A kernel executes in parallel across a set of parallel threads. The programmer or compiler organizes these threads in thread blocks and grids of thread blocks. The GPU instantiates a kernel program on a grid of parallel thread blocks. Each thread within a thread block executes an instance of the kernel, and has a thread ID within its thread block, pro- gram counter, registers, per-thread private memory, inputs, and Thread Thread Block Grid 0 Grid 1 per-Thread Private Local Memory per-Block Shared Memory per- Application Context Global Memory ... ...
    • 162 output results. A thread block is a set of concurrently executing threads that can cooperate among themselves through barrier synchro- nization and shared memory. A thread block has a block ID within its grid. A grid is an array of thread blocks that execute the same kernel, read inputs from global memory, write re- sults to global memory, and synchronize between dependent kernel calls. In the CUDA parallel programming model, each thread has a per-thread private memory space used for regis- ter spills, function calls, and C automatic array variables. Each thread block has a per-block shared memory space used for inter-thread communication, data sharing, and result sharing in parallel algorithms. Grids of thread blocks share results in global memory space after kernel-wide global synchroniza- tion. CUDA’s hierarchy of threads maps to a hierarchy of proces- sors on the GPU; a GPU executes one or more kernel grids; a streaming multiprocessor (SM) executes one or more thread blocks; and CUDA cores and other execution units in the SM execute threads. The SM executes threads in groups of 32 threads called a warp. While programmers can generally ig- nore warp execution for functional correctness and think of programming one thread, they can greatly improve perfor- mance by having threads in a warp execute the same code path and access memory in nearby addresses. See the main article “GPU Computing” for further details. DirectCompute An application programming interface (API) that supports gen- eral-purpose computing on graphics processing units (GPUs) on Microsoft Windows Vista or Windows 7. DirectCompute is part of the Microsoft DirectX collection of APIs and was initial- ly released with the DirectX 11 API but runs on both DirectX 10 and DirectX 11 GPUs. The DirectCompute architecture shares a range of computational interfaces with → OpenCL and → CUDA. ETL (“Extract, Transform, Load”) A process in database usage and especially in data warehousing that involves: ʎʎ Extracting data from outside sources ʎʎ Transforming it to fit operational needs (which can include quality levels) ʎʎ Loading it into the end target (database or data warehouse) The first part of an ETL process involves extracting the data from the source systems. In many cases this is the most challenging aspect of ETL, as extracting data correctly will set the stage for how subsequent processes will go. Most data warehousing proj- ects consolidate data from different source systems. Each sepa- rate system may also use a different data organization/format. Common data source formats are relational databases and flat files, but may include non-relational database structures such as Information Management System (IMS) or other data struc- tures such as Virtual Storage Access Method (VSAM) or Indexed Sequential Access Method (ISAM), or even fetching from outside sources such as through web spidering or screen-scraping. The streaming of the extracted data source and load on-the-fly to the destination database is another way of performing ETL when no intermediate data storage is required. In general, the goal of the extraction phase is to convert the data into a single format which is appropriate for transformation processing. An intrinsic part of the extraction involves the parsing of ex- Glossary
    • 163 tracted data, resulting in a check if the data meets an expected pattern or structure. If not, the data may be rejected entirely or in part. The transform stage applies a series of rules or functions to the extracted data from the source to derive the data for loading into the end target. Some data sources will require very little or even no manipulation of data. In other cases, one or more of the following transformation types may be required to meet the business and technical needs of the target database. The load phase loads the data into the end target, usually the data warehouse (DW). Depending on the requirements of the organization, this process varies widely. Some data warehouses may overwrite existing information with cumulative informa- tion, frequently updating extract data is done on daily, weekly or monthly basis. Other DW (or even other parts of the same DW) may add new data in a historicized form, for example, hourly. To understand this, consider a DW that is required to maintain sales records of the last year. Then, the DW will overwrite any data that is older than a year with newer data. However, the entry of data for any one year window will be made in a historicized man- ner. The timing and scope to replace or append are strategic de- sign choices dependent on the time available and the business needs. More complex systems can maintain a history and audit trail of all changes to the data loaded in the DW. As the load phase interacts with a database, the constraints defined in the database schema — as well as in triggers activated upon data load — apply (for example, uniqueness, referential integrity, mandatory fields), which also contribute to the overall data quality performance of the ETL process. FFTW (“Fastest Fourier Transform in the West”) A software library for computing discrete Fourier transforms (DFTs), developed by Matteo Frigo and Steven G. Johnson at the Massachusetts Institute of Technology. FFTW is known as the fastest free software implementation of the Fast Fourier transform (FFT) algorithm (upheld by regular benchmarks). It can compute transforms of real and complex-valued arrays of arbitrary size and dimension in O(n log n) time. floating point standard (IEEE 754) The most widely-used standard for floating-point computation, and is followed by many hardware (CPU and FPU) and software implementations. Many computer languages allow or require that some or all arithmetic be carried out using IEEE 754 for- mats and operations. The current version is IEEE 754-2008, which was published in August 2008; the original IEEE 754-1985 was published in 1985. The standard defines arithmetic for- mats, interchange formats, rounding algorithms, operations, and exception handling. The standard also includes extensive recommendations for advanced exception handling, additional operations (such as trigonometric functions), expression evalu- ation, and for achieving reproducible results. The standard defines single-precision, double-precision, as well as 128-byte quadruple-precision floating point numbers. In the proposed 754r version, the standard also defines the 2-byte half-precision number format. FraunhoferFS (FhGFS) A high-performance parallel file system from the Fraunhofer Competence Center for High Performance Computing. Built on scalable multithreaded core components with native → In-
    • 164 finiBand support, file system nodes can serve → InfiniBand and Ethernet (or any other TCP-enabled network) connections at the same time and automatically switch to a redundant con- nection path in case any of them fails. One of the most funda- mental concepts of FhGFS is the strict avoidance of architectur- al bottle necks. Striping file contents across multiple storage servers is only one part of this concept. Another important as- pect is the distribution of file system metadata (e.g. directory information) across multiple metadata servers. Large systems and metadata intensive applications in general can greatly profit from the latter feature. FhGFS requires no dedicated file system partition on the serv- ers – it uses existing partitions, formatted with any of the stan- dard Linux file systems, e.g. XFS or ext4. For larger networks, it is also possible to create several distinct FhGFS file system parti- tions with different configurations. FhGFS provides a coherent mode, in which it is guaranteed that changes to a file or directo- ry by one client are always immediately visible to other clients. Global Arrays (GA) A library developed by scientists at Pacific Northwest National Laboratory for parallel computing. GA provides a friendly API for shared-memory programming on distributed-memory comput- ers for multidimensional arrays. The GA library is a predecessor to the GAS (global address space) languages currently being de- veloped for high-performance computing. The GA toolkit has ad- ditional libraries including a Memory Allocator (MA), Aggregate Remote Memory Copy Interface (ARMCI), and functionality for out-of-core storage of arrays (ChemIO). Although GA was initially developed to run with TCGMSG, a message passing library that came before the → MPI standard (Message Passing Interface), it is now fully compatible with → MPI. GA includes simple matrix computations (matrix-matrix multiplication, LU solve) and works with → ScaLAPACK. Sparse matrices are available but the imple- mentation is not optimal yet. GA was developed by Jarek Nieplo- cha, Robert Harrison and R. J. Littlefield. The ChemIO library for out-of-core storage was developed by Jarek Nieplocha, Robert Harrison and Ian Foster. The GA library is incorporated into many quantum chemistry packages, including NWChem, MOLPRO, UTChem, MOLCAS, and TURBOMOLE. The GA toolkit is free soft- ware, licensed under a self-made license. Globus Toolkit An open source toolkit for building computing grids developed and provided by the Globus Alliance, currently at version 5. GMP (“GNU Multiple Precision Arithmetic Library”) A free library for arbitrary-precision arithmetic, operating on signed integers, rational numbers, and floating point num- bers. There are no practical limits to the precision except the ones implied by the available memory in the machine GMP runs on (operand dimension limit is 231 bits on 32-bit ma- chines and 237 bits on 64-bit machines). GMP has a rich set of functions, and the functions have a regular interface. The ba- sic interface is for C but wrappers exist for other languages in- cluding C++, C#, OCaml, Perl, PHP, and Python. In the past, the Kaffe Java virtual machine used GMP to support Java built-in arbitrary precision arithmetic. This feature has been removed from recent releases, causing protests from people who claim that they used Kaffe solely for the speed benefits afforded by GMP. As a result, GMP support has been added to GNU Class- path. The main target applications of GMP are cryptography Glossary
    • 165 applications and research, Internet security applications, and computer algebra systems. GotoBLAS Kazushige Goto’s implementation of → BLAS. grid (in CUDA architecture) Part of the → CUDA programming model GridFTP An extension of the standard File Transfer Protocol (FTP) for use with Grid computing. It is defined as part of the → Globus toolkit, under the organisation of the Global Grid Forum (spe- cifically, by the GridFTP working group). The aim of GridFTP is to provide a more reliable and high performance file transfer for Grid computing applications. This is necessary because of the increased demands of transmitting data in Grid comput- ing – it is frequently necessary to transmit very large files, and this needs to be done fast and reliably. GridFTP is the answer to the problem of incompatibility between storage and access systems. Previously, each data provider would make their data available in their own specific way, providing a library of ac- cess functions. This made it difficult to obtain data from multi- ple sources, requiring a different access method for each, and thus dividing the total available data into partitions. GridFTP provides a uniform way of accessing the data, encompassing functions from all the different modes of access, building on and extending the universally accepted FTP standard. FTP was chosen as a basis for it because of its widespread use, and be- cause it has a well defined architecture for extensions to the protocol (which may be dynamically discovered). Hierarchical Data Format (HDF) A set of file formats and libraries designed to store and organize large amounts of numerical data. Originally developed at the National Center for Supercomputing Applications, it is currently supported by the non-profit HDF Group, whose mission is to en- sure continued development of HDF5 technologies, and the con- tinued accessibility of data currently stored in HDF. In keeping with this goal, the HDF format, libraries and associated tools are available under a liberal, BSD-like license for general use. HDF is supported by many commercial and non-commercial software platforms, including Java, MATLAB, IDL, and Python. The freely available HDF distribution consists of the library, command-line utilities, test suite source, Java interface, and the Java-based HDF Viewer (HDFView). There currently exist two major versions of HDF, HDF4 and HDF5, which differ significantly in design and API. HLSL (“High Level Shader Language“) The High Level Shader Language or High Level Shading Language (HLSL) is a proprietary shading language developed by Microsoft for use with the Microsoft Direct3D API. It is analogous to the GLSL shading language used with the OpenGL standard. It is very similar to the NVIDIA Cg shading language, as it was developed alongside it. HLSL programs come in three forms, vertex shaders, geometry shaders, and pixel (or fragment) shaders. A vertex shader is ex- ecuted for each vertex that is submitted by the application, and is primarily responsible for transforming the vertex from object space to view space, generating texture coordinates, and calcu- lating lighting coefficients such as the vertex’s tangent, binor- mal and normal vectors. When a group of vertices (normally 3, to form a triangle) come through the vertex shader, their output po-
    • 166 sition is interpolated to form pixels within its area; this process is known as rasterisation. Each of these pixels comes through the pixel shader, whereby the resultant screen colour is calculat- ed. Optionally, an application using a Direct3D10 interface and Direct3D10 hardware may also specify a geometry shader. This shader takes as its input the three vertices of a triangle and uses this data to generate (or tessellate) additional triangles, which are each then sent to the rasterizer. InfiniBand Switched fabric communications link primarily used in HPC and enterprise data centers. Its features include high throughput, low latency, quality of service and failover, and it is designed to be scalable. The InfiniBand architecture specification defines a connection between processor nodes and high performance I/O nodes such as storage devices. Like → PCI Express, and many other modern interconnects, InfiniBand offers point-to-point bidirec- tional serial links intended for the connection of processors with high-speed peripherals such as disks. On top of the point to point capabilities, InfiniBand also offers multicast operations as well. It supports several signalling rates and, as with PCI Express, links can be bonded together for additional throughput. The SDR serial connection’s signalling rate is 2.5 gigabit per sec- ond (Gbit/s) in each direction per connection. DDR is 5 Gbit/s and QDR is 10 Gbit/s. FDR is 14.0625 Gbit/s and EDR is 25.78125 Gbit/s per lane. For SDR, DDR and QDR, links use 8B/10B encoding – every 10 bits sent carry 8 bits of data – making the useful data transmis- sion rate four-fifths the raw rate. Thus single, double, and quad data rates carry 2, 4, or 8 Gbit/s useful data, respectively. For FDR and EDR, links use 64B/66B encoding – every 66 bits sent carry 64 bits of data. Implementers can aggregate links in units of 4 or 12, called 4X or 12X. A 12X QDR link therefore carries 120 Gbit/s raw, or 96 Gbit/s of useful data. As of 2009 most systems use a 4X aggregate, imply- ing a 10 Gbit/s (SDR), 20 Gbit/s (DDR) or 40 Gbit/s (QDR) connection. Larger systems with 12X links are typically used for cluster and supercomputer interconnects and for inter-switch connections. The single data rate switch chips have a latency of 200 nano- seconds, DDR switch chips have a latency of 140 nanoseconds and QDR switch chips have a latency of 100 nanoseconds. The end-to-end latency range ranges from 1.07 microseconds MPI la- tency to 1.29 microseconds MPI latency to 2.6 microseconds. As of 2009 various InfiniBand host channel adapters (HCA) exist in the market, each with different latency and bandwidth charac- teristics. InfiniBand also provides RDMA capabilities for low CPU overhead. The latency for RDMA operations is less than 1 micro- second. See the main article “InfiniBand” for further description of In- finiBand features Intel Integrated Performance Primitives (Intel IPP) A multi-threaded software library of functions for multimedia and data processing applications, produced by Intel. The library supports Intel and compatible processors and is available for Windows, Linux, and Mac OS X operating systems. It is available separately or as a part of Intel Parallel Studio. The library takes advantage of processor features including MMX, SSE, SSE2, SSE3, SSSE3, SSE4, AES-NI and multicore processors. Intel IPP is divided into four major processing groups: Signal (with linear array or vector data), Image (with 2D arrays for typical color spaces), Ma- trix (with n x m arrays for matrix operations), and Cryptography. Glossary
    • 167 Half the entry points are of the matrix type, a third are of the signal type and the remainder are of the image and cryptogra- phy types. Intel IPP functions are divided into 4 data types: Data types include 8u (8-bit unsigned), 8s (8-bit signed), 16s, 32f (32-bit floating-point), 64f, etc. Typically, an application developer works with only one dominant data type for most processing func- tions, converting between input to processing to output formats at the end points. Version 5.2 was introduced June 5, 2007, add- ing code samples for data compression, new video codec sup- port, support for 64-bit applications on Mac OS X, support for Windows Vista, and new functions for ray-tracing and rendering. Version 6.1 was released with the Intel C++ Compiler on June 28, 2009 and Update 1 for version 6.1 was released on July 28, 2009. Intel Threading Building Blocks (TBB) A C++ template library developed by Intel Corporation for writing software programs that take advantage of multi-core processors. The library consists of data structures and algo- rithms that allow a programmer to avoid some complications arising from the use of native threading packages such as POSIX → threads, Windows → threads, or the portable Boost Threads in which individual → threads of execution are cre- ated, synchronized, and terminated manually. Instead the li- brary abstracts access to the multiple processors by allowing the operations to be treated as “tasks”, which are allocated to individual cores dynamically by the library’s run-time en- gine, and by automating efficient use of the CPU cache. A TBB program creates, synchronizes and destroys graphs of depen- dent tasks according to algorithms, i.e. high-level parallel pro- gramming paradigms (a.k.a. Algorithmic Skeletons). Tasks are then executed respecting graph dependencies. This approach groups TBB in a family of solutions for parallel programming aiming to decouple the programming from the particulars of the underlying machine. Intel TBB is available commercially as a binary distribution with support and in open source in both source and binary forms. Version 4.0 was introduced on September 8, 2011. iSER (“iSCSI Extensions for RDMA“) A protocol that maps the iSCSI protocol over a network that pro- vides RDMA services (like → iWARP or → InfiniBand). This permits data to be transferred directly into SCSI I/O buffers without in- termediate data copies. The Datamover Architecture (DA) defines an abstract model in which the movement of data between iSCSI end nodes is logically separated from the rest of the iSCSI proto- col. iSER is one Datamover protocol. The interface between the iSCSI and a Datamover protocol, iSER in this case, is called Dat- amover Interface (DI). SDR Single Data Rate DDR Double Data Rate QDR Quadruple Data Rate FDR Fourteen Data Rate EDR Enhanced Data Rate 1X 2 Gbit/s 4 Gbit/s 8 Gbit/s 14 Gbit/s 25 Gbit/s 4X 8 Gbit/s 16 Gbit/s 32 Gbit/s 56 Gbit/s 100 Gbit/s 12X 24 Gbit/s 48 Gbit/s 96 Gbit/s 168 Gbit/s 300 Gbit/s
    • 168 iWARP (“Internet Wide Area RDMA Protocol”) An Internet Engineering Task Force (IETF) update of the RDMA Consortium’s → RDMA over TCP standard. This later standard is zero-copy transmission over legacy TCP. Because a kernel imple- mentation of the TCP stack is a tremendous bottleneck, a few vendors now implement TCP in hardware. This additional hard- ware is known as the TCP offload engine (TOE). TOE itself does not prevent copying on the receive side, and must be combined with RDMA hardware for zero-copy results. The main component is the Data Direct Protocol (DDP), which permits the actual zero- copy transmission. The transmission itself is not performed by DDP, but by TCP. kernel (in CUDA architecture) Part of the → CUDA programming model LAM/MPI A high-quality open-source implementation of the → MPI specifi- cation, including all of MPI-1.2 and much of MPI-2. Superseded by the → OpenMPI implementation- LAPACK (“linear algebra package”) Routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The original goal of the LA- PACK project was to make the widely used EISPACK and → LINPACK libraries run efficiently on shared-memory vector and parallel pro- cessors. LAPACK routines are written so that as much as possible of the computation is performed by calls to the → BLAS library. While → LINPACK and EISPACK are based on the vector operation kernels of the Level 1 BLAS, LAPACK was designed at the outset to exploit the Level 3 BLAS. Highly efficient machine-specific imple- mentations of the BLAS are available for many modern high-per- formance computers. The BLAS enable LAPACK routines to achieve high performance with portable software. layout Part of the → parallel NFS standard. Currently three types of lay- out exist: file-based, block/volume-based, and object-based, the latter making use of → object-based storage devices LINPACK A collection of Fortran subroutines that analyze and solve linear equations and linear least-squares problems. LINPACK was de- signed for supercomputers in use in the 1970s and early 1980s. LINPACK has been largely superseded by → LAPACK, which has been designed to run efficiently on shared-memory, vector su- percomputers. LINPACK makes use of the → BLAS libraries for performing basic vector and matrix operations. The LINPACK benchmarks are a measure of a system‘s float- ing point computing power and measure how fast a computer solves a dense N by N system of linear equations Ax=b, which is a common task in engineering. The solution is obtained by Gauss- ian elimination with partial pivoting, with 2/3•N³ + 2•N² floating point operations. The result is reported in millions of floating point operations per second (MFLOP/s, sometimes simply called MFLOPS). LNET Communication protocol in → Lustre Glossary
    • 169 logical object volume (LOV) A logical entity in → Lustre Lustre An object-based → parallel file system management server (MGS) A functional component in → Lustre MapReduce A framework for processing highly distributable problems across huge datasets using a large number of computers (nodes), collec- tively referred to as a cluster (if all nodes use the same hardware) or a grid (if the nodes use different hardware). Computational processing can occur on data stored either in a filesystem (un- structured) or in a database (structured). “Map” step: The master node takes the input, divides it into smaller sub-problems, and distributes them to worker nodes. A worker node may do this again in turn, leading to a multi-level tree structure. The worker node processes the smaller problem, and passes the answer back to its master node. “Reduce” step: The master node then collects the answers to all the sub-problems and combines them in some way to form the output – the answer to the problem it was originally trying to solve. MapReduce allows for distributed processing of the map and reduction operations. Provided each mapping operation is independent of the others, all maps can be performed in par- allel – though in practice it is limited by the number of inde- pendent data sources and/or the number of CPUs near each source. Similarly, a set of ‘reducers’ can perform the reduction phase - provided all outputs of the map operation that share the same key are presented to the same reducer at the same time. While this process can often appear inefficient com- pared to algorithms that are more sequential, MapReduce can be applied to significantly larger datasets than “commodity” servers can handle – a large server farm can use MapReduce to sort a petabyte of data in only a few hours. The parallelism also offers some possibility of recovering from partial failure of servers or storage during the operation: if one mapper or reducer fails, the work can be rescheduled – assuming the in- put data is still available. metadata server (MDS) A functional component in → Lustre metadata target (MDT) A logical entity in → Lustre MKL (“Math Kernel Library”) A library of optimized, math routines for science, engineering, and financial applications developed by Intel. Core math func- tions include → BLAS, → LAPACK, → ScaLAPACK, Sparse Solvers, Fast Fourier Transforms, and Vector Math. The library supports Intel and compatible processors and is available for Windows, Linux and Mac OS X operating systems. MPI, MPI-2 (“message-passing interface”) A language-independent communications protocol used to program parallel computers. Both point-to-point and collec- tive communication are supported. MPI remains the domi- nant model used in high-performance computing today. There
    • 170 are two versions of the standard that are currently popular: version 1.2 (shortly called MPI-1), which emphasizes message passing and has a static runtime environment, and MPI-2.1 (MPI-2), which includes new features such as parallel I/O, dy- namic process management and remote memory operations. MPI-2 specifies over 500 functions and provides language bindings for ANSI C, ANSI Fortran (Fortran90), and ANSI C++. In- teroperability of objects defined in MPI was also added to allow for easier mixed-language message passing programming. A side effect of MPI-2 standardization (completed in 1996) was clarifi- cation of the MPI-1 standard, creating the MPI-1.2 level. MPI-2 is mostly a superset of MPI-1, although some functions have been deprecated. Thus MPI-1.2 programs still work under MPI imple- mentations compliant with the MPI-2 standard. The MPI Forum reconvened in 2007, to clarify some MPI-2 issues and explore de- velopments for a possible MPI-3. MPICH2 A freely available, portable → MPI 2.0 implementation, main- tained by Argonne National Laboratory MPP (“massively parallel processing”) So-called MPP jobs are computer programs with several parts running on several machines in parallel, often calculating simu- lation problems. The communication between these parts can e.g. be realized by the → MPI software interface. MS-MPI Microsoft → MPI 2.0 implementation shipped with Microsoft HPC Pack 2008 SDK, based on and designed for maximum compatibil- ity with the → MPICH2 reference implementation. MVAPICH2 An → MPI 2.0 implementation based on → MPICH2 and developed by the Department of Computer Science and Engineering at Ohio State University. It is available under BSD licensing and sup- ports MPI over InfiniBand, 10GigE/iWARP and RDMAoE. NetCDF (“Network Common Data Form”) A set of software libraries and self-describing, machine-indepen- dent data formats that support the creation, access, and shar- ing of array-oriented scientific data. The project homepage is hosted by the Unidata program at the University Corporation for Atmospheric Research (UCAR). They are also the chief source of NetCDF software, standards development, updates, etc. The format is an open standard. NetCDF Classic and 64-bit Offset For- mat are an international standard of the Open Geospatial Con- sortium. The project is actively supported by UCAR. The recently released (2008) version 4.0 greatly enhances the data model by allowing the use of the → HDF5 data file format. Version 4.1 (2010) adds support for C and Fortran client access to specified subsets of remote data via OPeNDAP. The format was originally based on the conceptual model of the NASA CDF but has since diverged and is not compatible with it. It is commonly used in climatology, meteorology and oceanography applications (e.g., weather fore- casting, climate change) and GIS applications. It is an input/out- put format for many GIS applications, and for general scientific data exchange. The NetCDF C library, and the libraries based on it (Fortran 77 and Fortran 90, C++, and all third-party libraries) can, starting with version 4.1.1, read some data in other data formats. Data in the → HDF5 format can be read, with some restrictions. Data in the → HDF4 format can be read by the NetCDF C library if created using the → HDF4 Scientific Data (SD) API. Glossary
    • 171 NetworkDirect A remote direct memory access (RDMA)-based network in- terface implemented in Windows Server 2008 and later. Net- workDirect uses a more direct path from → MPI applications to networking hardware, resulting in very fast and efficient networking. See the main article “Windows HPC Server 2008 R2” for further details. NFS (Network File System) A network file system protocol originally developed by Sun Microsystems in 1984, allowing a user on a client computer to access files over a network in a manner similar to how local storage is accessed. NFS, like many other protocols, builds on the Open Network Computing Remote Procedure Call (ONC RPC) system. The Network File System is an open standard defined in RFCs, allowing anyone to implement the protocol. Sun used version 1 only for in-house experimental purposes. When the development team added substantial changes to NFS version 1 and released it outside of Sun, they decided to release the new version as V2, so that version interoperation and RPC version fallback could be tested. Version 2 of the pro- tocol (defined in RFC 1094, March 1989) originally operated entirely over UDP. Its designers meant to keep the protocol stateless, with locking (for example) implemented outside of the core protocol Version 3 (RFC 1813, June 1995) added: ʎʎ support for 64-bit file sizes and offsets, to handle files larger than 2 gigabytes (GB) ʎʎ support for asynchronous writes on the server, to improve write performance ʎʎ additional file attributes in many replies, to avoid the need to re-fetch them ʎʎ a READDIRPLUS operation, to get file handles and attributes along with file names when scanning a directory ʎʎ assorted other improvements At the time of introduction of Version 3, vendor support for TCP as a transport-layer protocol began increasing. While several vendors had already added support for NFS Version 2 with TCP as a transport, Sun Microsystems added support for TCP as a transport for NFS at the same time it added support for Ver- sion 3. Using TCP as a transport made using NFS over a WAN more feasible. Version 4 (RFC 3010, December 2000; revised in RFC 3530, April 2003), influenced by AFS and CIFS, includes performance im- provements, mandates strong security, and introduces a state- ful protocol. Version 4 became the first version developed with the Internet Engineering Task Force (IETF) after Sun Microsys- tems handed over the development of the NFS protocols. NFS version 4 minor version 1 (NFSv 4.1) has been approved by the IESG and received an RFC number since Jan 2010. The NFSv 4.1 specification aims: to provide protocol support to take advantage of clustered server deployments including the ability to provide scalable parallel access to files distributed among multiple serv- ers. NFSv 4.1 adds the parallel NFS (pNFS) capability, which enables data access parallelism. The NFSv 4.1 protocol defines a method of separating the filesystem meta-data from the location of the file data; it goes beyond the simple name/data separation by strip- ing the data amongst a set of data servers. This is different from the traditional NFS server which holds the names of files and their data under the single umbrella of the server.
    • 172 In addition to pNFS, NFSv 4.1 provides sessions, directory delega- tion and notifications, multi-server namespace, access control lists (ACL/SACL/DACL), retention attributions, and SECINFO_NO_ NAME. See the main article “Parallel Filesystems” for further de- tails. Current work is being done in preparing a draft for a future ver- sion 4.2 of the NFS standard, including so-called federated file- systems, which constitute the NFS counterpart of Microsoft’s distributed filesystem (DFS). NUMA (“non-uniform memory access”) A computer memory design used in multiprocessors, where the memory access time depends on the memory location relative to a processor. Under NUMA, a processor can access its own local memory faster than non-local memory, that is, memory local to another processor or memory shared between processors. object storage server (OSS) A functional component in → Lustre object storage target (OST) A logical entity in → Lustre object-based storage device (OSD) An intelligent evolution of disk drives that can store and serve objects rather than simply place data on tracks and sectors. This task is accomplished by moving low-level storage functions into the storage device and accessing the device through an object interface. Unlike a traditional block-oriented device providing ac- cess to data organized as an array of unrelated blocks, an object Glossary store allows access to data by means of storage objects. A storage object is a virtual entity that groups data together that has been determined by the user to be logically related. Space for a storage object is allocated internally by the OSD itself instead of by a host- based file system. OSDs manage all necessary low-level storage, space management, and security functions. Because there is no host-based metadata for an object (such as inode information), the only way for an application to retrieve an object is by using its object identifier (OID). The SCSI interface was modified and ex- tended by the OSD Technical Work Group of the Storage Network- ing Industry Association (SNIA) with varied industry and academic contributors, resulting in a draft standard to T10 in 2004. This stan- dard was ratified in September 2004 and became the ANSI T10 SCSI OSD V1 command set, released as INCITS 400-2004. The SNIA group continues to work on further extensions to the interface, such as the ANSI T10 SCSI OSD V2 command set. OLAP cube (“Online Analytical Processing”) A set of data, organized in a way that facilitates non-predeter- mined queries for aggregated information, or in other words, online analytical processing. OLAP is one of the computer-based techniques for analyzing business data that are collectively called business intelligence. OLAP cubes can be thought of as ex- tensions to the two-dimensional array of a spreadsheet. For ex- ample a company might wish to analyze some financial data by product, by time-period, by city, by type of revenue and cost, and by comparing actual data with a budget. These additional meth- ods of analyzing the data are known as dimensions. Because there can be more than three dimensions in an OLAP system the term hypercube is sometimes used.
    • PCIe 1.x PCIe 2.x PCIe 3.0 PCIe 4.0 x1 256 MB/s 512 MB/s 1 GB/s 2 GB/s x2 512 MB/s 1 GB/s 2 GB/s 4 GB/s x4 1 GB/s 2 GB/s 4 GB/s 8 GB/s x8 2 GB/s 4 GB/s 8 GB/s 16 GB/s x16 4 GB/s 8 GB/s 16 GB/s 32 GB/s x32 8 GB/s 16 GB/s 32 GB/s 64 GB/s 173 OpenCL (“Open Computing Language”) A framework for writing programs that execute across heteroge- neous platforms consisting of CPUs, GPUs, and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), plus APIs that are used to define and then control the platforms. OpenCL provides parallel computing using task-based and data-based parallelism. OpenCL is analogous to the open industry standards OpenGL and OpenAL, for 3D graphics and computer audio, respectively. Originally developed by Apple Inc., which holds trademark rights, OpenCL is now managed by the non-profit technology consor- tium Khronos Group. OpenMP (“Open Multi-Processing”) An application programming interface (API) that supports multi- platform shared memory multiprocessing programming in C, C++ and Fortran on many architectures, including Unix and Mi- crosoft Windows platforms. It consists of a set of compiler direc- tives, library routines, and environment variables that influence run-time behavior. Jointly defined by a group of major computer hardware and soft- ware vendors, OpenMP is a portable, scalable model that gives programmers a simple and flexible interface for developing par- allel applications for platforms ranging from the desktop to the supercomputer. An application built with the hybrid model of parallel program- ming can run on a computer cluster using both OpenMP and Message Passing Interface (MPI), or more transparently through the use of OpenMP extensions for non-shared memory systems. OpenMPI An open source → MPI-2 implementation that is developed and maintained by a consortium of academic, research, and industry partners. parallel NFS (pNFS) A → parallel file system standard, optional part of the current → NFS standard 4.1. See the main article “Parallel Filesystems” for further details. PCI Express (PCIe) A computer expansion card standard designed to replace the older PCI, PCI-X, and AGP standards. Introduced by Intel in 2004, PCIe (or PCI-E, as it is commonly called) is the latest standard for expansion cards that is available on mainstream comput- ers. PCIe, unlike previous PC expansion standards, is structured around point-to-point serial links, a pair of which (one in each direction) make up lanes; rather than a shared parallel bus. These lanes are routed by a hub on the main-board acting as a crossbar switch. This dynamic point-to-point behavior allows
    • 174 more than one pair of devices to communicate with each other at the same time. In contrast, older PC interfaces had all devices permanently wired to the same bus; therefore, only one device could send information at a time. This format also allows “chan- nel grouping”, where multiple lanes are bonded to a single de- vice pair in order to provide higher bandwidth. The number of lanes is “negotiated” during power-up or explicitly during op- eration. By making the lane count flexible a single standard can provide for the needs of high-bandwidth cards (e.g. graphics cards, 10 Gigabit Ethernet cards and multiport Gigabit Ethernet cards) while also being economical for less demanding cards. Unlike preceding PC expansion interface standards, PCIe is a net- work of point-to-point connections. This removes the need for “arbitrating” the bus or waiting for the bus to be free and allows for full duplex communications. This means that while standard PCI-X (133 MHz 64 bit) and PCIe x4 have roughly the same data transfer rate, PCIe x4 will give better performance if multiple device pairs are communicating simultaneously or if communi- cation within a single device pair is bidirectional. Specifications of the format are maintained and developed by a group of more than 900 industry-leading companies called the PCI-SIG (PCI Spe- cial Interest Group). In PCIe 1.x, each lane carries approximately 250 MB/s. PCIe 2.0, released in late 2007, adds a Gen2-signalling mode, doubling the rate to about 500 MB/s. On November 18, 2010, the PCI Special Interest Group officially publishes the final- ized PCI Express 3.0 specification to its members to build devic- es based on this new version of PCI Express, which allows for a Gen3-signalling mode at 1 GB/s. On November 29, 2011, PCI-SIG has annonced to proceed to PCI Express 4.0 featuring 16 GT/s, still on copper technology. Addi- Glossary tionally, active and idle power optimizations are to be investigat- ed. Final specifications are expected to be released in 2014/15. PETSc (“Portable, Extensible Toolkit for Scientific Computation”) A suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differen- tial equations. It employs the → Message Passing Interface (MPI) standard for all message-passing communication. The current version of PETSc is 3.2; released September 8, 2011. PETSc is in- tended for use in large-scale application projects, many ongoing computational science projects are built around the PETSc li- braries. Its careful design allows advanced users to have detailed control over the solution process. PETSc includes a large suite of parallel linear and nonlinear equation solvers that are eas- ily used in application codes written in C, C++, Fortran and now Python. PETSc provides many of the mechanisms needed within parallel application code, such as simple parallel matrix and vec- tor assembly routines that allow the overlap of communication and computation. In addition, PETSc includes support for paral- lel distributed arrays useful for finite difference methods. process → thread PTX (“parallel thread execution”) Parallel Thread Execution (PTX) is a pseudo-assembly language used in NVIDIA’s CUDA programming environment. The ‘nvcc’ compiler translates code written in CUDA, a C-like language, into PTX, and the graphics driver contains a compiler which trans- lates the PTX into something which can be run on the processing cores.
    • 175 RDMA (“remote direct memory access”) Allows data to move directly from the memory of one computer into that of another without involving either one‘s operating system. This permits high-throughput, low-latency networking, which is especially useful in massively parallel computer clus- ters. RDMA relies on a special philosophy in using DMA. RDMA supports zero-copy networking by enabling the network adapt- er to transfer data directly to or from application memory, elim- inating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. When an application performs an RDMA Read or Write request, the application data is delivered directly to the network, reduc- ing latency and enabling fast message transfer. Common RDMA implementations include → InfiniBand, → iSER, and → iWARP. RISC (“reduced instruction-set computer”) A CPU design strategy emphasizing the insight that simplified instructions that “do less“ may still provide for higher perfor- mance if this simplicity can be utilized to make instructions ex- ecute very quickly → CISC. ScaLAPACK (“scalable LAPACK”) Library including a subset of → LAPACK routines redesigned for distributed memory MIMD (multiple instruction, multiple data) parallel computers. It is currently written in a Single- Program-Multiple-Data style using explicit message passing for interprocessor communication. ScaLAPACK is designed for heterogeneous computing and is portable on any com- puter that supports → MPI. The fundamental building blocks of the ScaLAPACK library are distributed memory versions (PBLAS) of the Level 1, 2 and 3 → BLAS, and a set of Basic Linear Algebra Communication Subprograms (BLACS) for communi- cation tasks that arise frequently in parallel linear algebra computations. In the ScaLAPACK routines, all interprocessor communication occurs within the PBLAS and the BLACS. One of the design goals of ScaLAPACK was to have the ScaLAPACK routines resemble their → LAPACK equivalents as much as possible. service-oriented architecture (SOA) An approach to building distributed, loosely coupled applica- tions in which functions are separated into distinct services that can be distributed over a network, combined, and reused. See the main article “Windows HPC Server 2008 R2” for further details. single precision/double precision → floating point standard SMP (“shared memory processing”) So-called SMP jobs are computer programs with several parts running on the same system and accessing a shared memory region. A usual implementation of SMP jobs is → multi-threaded programs. The communication between the single threads can e.g. be realized by the → OpenMP software interface standard, but also in a non-standard way by means of native UNIX interprocess communication mechanisms. SMP (“symmetric multiprocessing”) A multiprocessor or multicore computer architecture where two or more identical processors or cores can connect to a
    • 176 Glossary single shared main memory in a completely symmetric way, i.e. each part of the main memory has the same distance to each of the cores. Opposite: → NUMA storage access protocol Part of the → parallel NFS standard STREAM A simple synthetic benchmark program that measures sustain- able memory bandwidth (in MB/s) and the corresponding com- putation rate for simple vector kernels. streaming multiprocessor (SM) Hardware component within the → Tesla GPU series subnet manager Application responsible for configuring the local → InfiniBand subnet and ensuring its continued operation. superscalar processors A superscalar CPU architecture implements a form of parallel- ism called instruction-level parallelism within a single proces- sor. It thereby allows faster CPU throughput than would other- wise be possible at the same clock rate. A superscalar processor executes more than one instruction during a clock cycle by si- multaneously dispatching multiple instructions to redundant functional units on the processor. Each functional unit is not a separate CPU core but an execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a multi- plier. Tesla NVIDIA‘s third brand of GPUs, based on high-end GPUs from the G80 and on. Tesla is NVIDIA‘s first dedicated General Pur- pose GPU. Because of the very high computational power (measured in floating point operations per second or FLOPS) compared to recent microprocessors, the Tesla products are intended for the HPC market. The primary function of Tesla products are to aid in simulations, large scale calculations (es- pecially floating-point calculations), and image generation for professional and scientific fields, with the use of → CUDA. See the main article “NVIDIA GPU Computing” for further details. thread A thread of execution is a fork of a computer program into two or moreconcurrentlyrunningtasks.Theimplementationofthreads and processes differs from one operating system to another, but in most cases, a thread is contained inside a process. On a single processor, multithreading generally occurs by multitasking: the processor switches between different threads. On a multiproces- sor or multi-core system, the threads or tasks will generally run at the same time, with each processor or core running a particu- lar thread or task. Threads are distinguished from processes in that processes are typically independent, while threads exist as subsets of a process. Whereas processes have separate address spaces, threads share their address space, which makes inter- thread communication much easier than classical inter-process communication (IPC). thread (in CUDA architecture) Part of the → CUDA programming model
    • 177 thread block (in CUDA architecture) Part of the → CUDA programming model thread processor array (TPA) Hardware component within the → Tesla GPU series 10 Gigabit Ethernet The fastest of the Ethernet standards, first published in 2002 as IEEE Std 802.3ae-2002. It defines a version of Ethernet with a nominal data rate of 10 Gbit/s, ten times as fast as Gigabit Eth- ernet. Over the years several 802.3 standards relating to 10GbE have been published, which later were consolidated into the IEEE 802.3-2005 standard. IEEE 802.3-2005 and the other amend- ments have been consolidated into IEEE Std 802.3-2008. 10 Gigabit Ethernet supports only full duplex links which can be connected by switches. Half Duplex operation and CSMA/CD (carrier sense multiple access with collision detect) are not supported in 10GbE. The 10 Gigabit Ethernet standard encom- passes a number of different physical layer (PHY) standards. As of 2008 10 Gigabit Ethernet is still an emerging technology with only 1 million ports shipped in 2007, and it remains to be seen which of the PHYs will gain widespread commercial ac- ceptance. warp (in CUDA architecture) Part of the → CUDA programming modelOLAP cube
    • Notes
    • Notes
    • Notes
    • I am interested in: P compute nodes and other hardware P cluster and workload management P HPC-related services P GPU computing P parallel filesystems and HPC storage P InfiniBand solutions P remote postprocessing and visualization P HPC databases and Big Data Analytics P YES: I would like more information about transtec’s HPC solutions. P YES: I would like a transtec HPC specialist to contact me. Company: Department: Name: Address: ZIP, City: Email: Please fill in the postcard on the right, or send an email to hpc@transtec.de for further information.
    • transtecAG Waldhoernlestrasse18 72072Tuebingen Germany
    • Simulation Simulation CAD CAD CAE CAE Engineering Engineering Price Modelling Price Modelling Life Sciences Life Sciences Automotive Automotive Aerospace Aerospace Risk Analysis Risk Analysis Big Data Analytics Big Data Analytics High Throughput Computing High Throughput Computing
    • transtec Germany Tel +49 (0) 7071/703-400 transtec@transtec.de www.transtec.de transtec Switzerland Tel +41 (0) 44/818 47 00 transtec.ch@transtec.ch www.transtec.ch transtec United Kingdom Tel +44 (0) 1295/756 500 transtec.uk@transtec.co.uk www.transtec.co.uk ttec Netherlands Tel +31 (0) 24 34 34 210 ttec@ttec.nl www.ttec.nl transtec France Tel +33 (0) transtec.fr@transtec.fr www.transtec.fr Texts and concept: Layout and design: Dr. Oliver Tennert, Director Technology Management & HPC Solutions | Oliver.Tennert@transtec.de Stefanie Gauger, Graphics & Design | Stefanie.Gauger@transtec.de © transtec AG, June 2013 The graphics, diagrams and tables found herein are the intellectual property of transtec AG and may be reproduced or published only with its express permission. No responsibility will be assumed for inaccuracies or omissions. Other names or logos may be trademarks of their respective owners. Simulation CAD CAEEngineering Price Modelling Life Sciences Automotive Aerospace Risk Analysis Risk Analysis Big Data Analytics High Throughput Computing