{DESCRIPTION} This screen displays an image of a group of IBM System x eX5 servers. {TRANSCRIPT} Welcome to IBM System x ™ Technical Principles - Introduction to Intelligent Clusters . This is Topic 7 of the System x Technical Principles Course series - XTW01.
{DESCRIPTION} This slide presents a bullet list of the course objectives. {TRANSCRIPT} The objectives of this course are: Describe a high-performance computing cluster List the business goals that Intelligent Clusters addresses Identify three core Intelligent Clusters components List the high-speed networking options available in Intelligent Clusters List three software tools used in Clusters Describe Cluster benchmarking
{DESCRIPTION} This slide presents a bullet list of the topics discussed TRANSCRIPT} In this topic we will explore: Commodity Clusters Overview of Intelligent Clusters Cluster Hardware Cluster Networking Cluster Management, Software Stack, and Benchmarking The next section will discuss Commodity Clusters
{DESCRIPTION} This slide presents a bullet list Commodity Cluster features. There is an image of a server rack in the lower right corner. {TRANSCRIPT} What is a Commodity Cluster? Clusters have many definitions in the industry. The most common definition of a commodity cluster is a group of interconnected computer systems which operate as a single logical entity to create a big, powerful system capable of solving complex computational problems. Clusters consist of commodity components, including server systems with up to two CPU sockets (up to eight total CPU cores) each based on commodity x86 or POWER architecture, a high-speed interconnect network such as InfiniBand and the associated switches for inter-process communication, a Storage subsystem, either directly attached to the individual nodes or a network-attached storage system such as NAS or SAN. The servers in the Cluster run a commodity operating system such as Linux or Microsoft Windows, and a common cluster-wide hardware and software management system such as xCAT. In addition, Clusters run a set of software applications that solve various industry problems in science, engineering, finance, and several others. As opposed to traditional techniques of creating powerful machines called Supercomputers, Clusters use commodity technologies to create a more cost-effective and yet scalable and equally performing (and in some cases exceeding) the traditional Supercomputers. Hence, clusters today based on commodity technologies enable computing at speeds of traditional Supercomputers.
{DESCRIPTION} This slide presents an exploded view of a clusters components. There is an image of each. {TRANSCRIPT} The Picture in this chart depicts the conceptual architecture of a Cluster system built out of commodity components (nodes, networking, switches, storage, etc.). The mid-portion of the picture shows a rack full of IBM server systems, which act as the core compute nodes of the cluster, and are connected in the backend using two separate network interfaces – Ethernet and InfiniBand. The Ethernet network is used mainly as the fabric for the management traffic (OS provisioning, remote hardware management, IPMI, Serial-over-LAN, running administrative commands on nodes, job scheduling, etc.). The InfiniBand network is used as the fabric for inter-process communication and message-passing across the nodes and provides high-bandwidth and low-latency interconnect across the nodes. On the right side of the picture is the Fibre-channel Storage Area Network (SAN) storage, which is directly attached via Fibre-optic network and switches to the Storage nodes. The storage nodes manage the SAN storage directly and they provide the shared Cluster file system, which is exported to the rest of the Cluster (i.e. the compute nodes). Other kinds of storage systems are also commonly used in Clusters, including network-attached storage (NAS) and Direct-attached storage (DAS). As shown in the picture, the different network traffic types are carefully isolated by configuring corresponding VLANs on the network switches. For example, all the management traffic is isolated into a separate VLAN (designated as the Management VLAN), which is only accessible to the main management node in the Cluster for security reasons. On the other hand, the Cluster VLAN connects the management node to the compute nodes and in some cases it is used as the VLAN for message-passing traffic. User access to the Cluster is provided via the “Login” nodes in the cluster. These nodes are designated as the main entry point for users into the cluster. Various user applications are installed on the Login nodes, such as compilers, message-passing libraries (e.g. MPICH), debuggers, job scheduler interface tools, etc. Using Login nodes in the cluster provides a secure and cleaner interface to users into the cluster. There are usually one or more management nodes in the cluster. The management node is where all the hardware management and other administrative tools necessary to deploy and manage the cluster are installed. Cluster administrators login to the management node to perform hardware management as well as for maintaining various software and user-related configuration activities on the cluster nodes.
{DESCRIPTION} This slide presents a graphic similar to a bar graph. Each bar represents an application type.5bornot5b {TRANSCRIPT} Commodity clusters are applied widely in the industry for solving complex computational problems with speed, accuracy and efficiency. As shown in the picture, Clusters are used in various industry vertical segments in Energy, Finance, Manufacturing, Life Sciences, Media and Entertainment, Public sector and Government. Several common applications of clusters in each of these industry segment are shown in the picture, including Seismic analysis (e.g. oil exploration), Portfolio risk analysis in Finance, finite element analysis (FEA) and engineering design in the manufacturing industry, weather forecasting, oceanography, and so on. As is evident, Clusters are not only used in traditional research and academic computing, but are also used in commercial industry segments today.
{DESCRIPTION} {TRANSCRIPT} Rapid advances in processor technologies and accelerated development of new techniques for improving computing efficiency are making cluster computing more and more attractive and easily deployable solution for various industries. One of the key advances that contributes to the success of HPC is the multi-core processor technology. Multi-core processors enable more dense computing and parallelism within the individual nodes in the cluster so that you can do more computation with less number of nodes. In addition, faster and bigger memory chips enable applications requiring lots of physical RAM as well as the ability to run multiple applications simultaneously without any performance penalties. Virtualization is a hot topic in the industry today, although the concept has been around for a while. Virtualization enables consolidation of physical resources and provides various other advantages. The application of virtualization to HPC is not yet attractive given the performance and scalability concerns for HPC applications. However, future advances in both hardware and software technologies might make virtualization interesting for a subset of HPC applications that don’t have stringent performance requirements and could potentially benefit from the reliability and fault-tolerance aspects of virtualization. As clusters scale beyond tens of hundreds of computers the complexity in managing and efficiently utilizing expensive resources becomes a concern for system administrators. Power consumption and equipment cooling in data centers are two key concerns in today’s large-scale computing environments. Several “Green” technologies and strategies are emerging to make computing resources more power-efficient and easy to cool.
{DESCRIPTION} {TRANSCRIPT} The next section will present an overview of the Intelligent Clusters
{DESCRIPTION} {TRANSCRIPT} There are multiple approaches for deploying clusters. As shown in the picture, customers have several choices when it comes to purchasing and installing clusters. Roll Your Own (RYO) In this approach, the customer orders all the required individual hardware components such as servers and switches from vendors like IBM. The customer then does the integration of these components on their own, or contracts a third party integrator. The disadvantage with this approach is that the customer doesn’t get a full solution from a single vendor, and they have to deal with warranty related issues through each hardware vendor directly – in other words, there is no single point of contact for support and warranty issues. IBM Racked and Stacked In this approach the client procures servers and storage components in standard racks from IBM and then integrates other third party components such as switches into these racks on their own or contracts IBM Global Services or some other integrator to do this work. The disadvantage with this approach is that the client needs to address warranty and support issues with each vendor directly. BP Integrated In this approach, a qualified IBM business partner works with the customer in ordering servers and storage components from IBM and networking components from third party vendors. The BP at that point will build the cluster by integrating the components and then delivers the cluster to the customer. The disadvantage with this approach is again the fact that the customer needs to address warranty and support issues directly with each vendor. Intelligent Clusters In this approach, the customer orders a fully integrated cluster solution from IBM directly, which includes all the server, storage as well as third party network switches The advantages with this approach are: IBM delivers the factory-built and tested cluster solution, ready to be deployed in the customer data center and easy to plug into their environment The customer can contact IBM for all warranty, service and support issues – in other words, a single point of contact
{DESCRIPTION} {TRANSCRIPT} IBM System Intelligent Clusters is an “integrated cluster” solution offering from IBM. The Intelligent Clusters offers a fully-integrated, turn-key cluster solution to customers by combining various IBM as well as OEM hardware components, IBM/third-party software components, implementation/management services, and provides a single-point of support for all hardware and software. As shown in the picture, the Intelligent Clusters consists of the following core components: IBM System x Servers: Rack mount servers Intel-based x3550 M3, x3650 M3 Blade servers: Intel-based HS22, HS22V and HX5 Blade servers High-density servers: IBM System x iDataPlex 2U and 3U FlexChassis technology Intel-based dx360 M3 servers IBM Systems Storage Fibre-channel, SAS and iSCSI based storage systems, switches and adapters Network switches 1Gbps Ethernet, 10Gbps Ethernet IBM Intelligent Clusters will integrate all these core components into a single cluster-optimized solution and deliver it to the customers as a fully-bundled and ready-for-deployment solution. Intelligent Clusters also offers professional implementation services for custom deployments via the System x Lab-based Services organization.
{DESCRIPTION} {TRANSCRIPT} As shown in the picture, the IBM HPC Cluster Solution is created by combining IBM Server hardware, third-party switches and storage, Cluster software applications such as GPFS, xCAT and Linux or Windows, and the applications and tools necessary to run customer’s own HPC codes. Hence, and HPC cluster solution consists of all the hardware and software components end-to-end, ready to execute high-performance parallel and cluster applications.
{DESCRIPTION} {TRANSCRIPT} In the following section we will discuss Cluster hardware.
{DESCRIPTION} Overview of server choices for intelligent clusters {TRANSCRIPT} This chart shows some of the key IBM System x server offerings that are used in Intelligent Clusters. On the left are the rack-optimized servers based on Intel processors – the 1U x3550 M3 and the 2U x3650 M3. On the right are the IBM Blade Server offerings, with the chassis optimized for various businesses, and the variety of Blade servers with the Intel processors. Together, these servers offer a wide range of capabilities to address needs of the specific industries and applications such as large memory, high-performance, and scalability.
{DESCRIPTION} {TRANSCRIPT} What challenges does iDataPlex address? iDataPlex is an innovative solution from IBM designed to better address data center challenges around compute density, not just rack density but to provide more servers into the client’s data center within their limited floor space, power and cooling infrastructure. These data centers need a solution that can be deployed quickly (scalability), and is easy to service and manage. iDataPlex is designed to address TCO (Total Cost of Ownership), not just acquisition costs but operational costs throughout the lifecycle of the deployment. Finally, iDataPlex is designed to be flexible, since every customer’s workload requirements are unique. IBM System x iDataPlex is the newest set of System x server offerings, targeted for extremely large-scale server deployments such as data centers running Web 2.0 and HPC style workloads. Customers having such a need for computational capacities at extreme scale can deploy iDataPlex, which has been optimized for such environments. iDataPlex is a custom rack design that allows up to 84 standard dual-CPU servers to be placed in the single rack that has the same footprint as a standard enterprise rack, which only allows a maximum of 42 servers. In addition, the iDataPlex rack design has been optimized for power and cooling so that the iDPX rack is more power-efficient and easy to cool than the traditional racks. iDataPlex supports the standard network switches and adapters for Ethernet and InfiniBand. A special chassis design called the FlexChassis allows various configurations of iDataPlex to be created by combining servers, storage and I/O options, to address the specific customer requirements. iDataPlex is one of the server choices offered for a Intelligent Clusters solution.
{DESCRIPTION} {TRANSCRIPT} The iDataPlex portfolio continues to evolve to meet the computing requirements in the data center of today and tomorrow. IBM introduced the dx360 M2 in March 2009, based on Intel Nehalem processors which provides maximum performance while maintaining outstanding performance per watt with the highly efficient iDataPlex design. In March 2010, IBM introduced the dx360 M3, increasing our performance and efficiency with the new Intel Westmere processors and new server capabilities which we will go into more detail on in the next few charts. We also have a 3U chassis available with the dx360 M3 server, which provides up to 12 3.5” SAS or SATA hard disk drives, up to 24TB per server for large capacity local storage. Again, within the iDataPlex rack we can mix these offerings to provide the specific rack-level solution that the client is looking for.
{DESCRIPTION} {TRANSCRIPT} This chart shows various iDataPlex chassis and server configuration options. The compute intensive configuration allows two 1U compute servers in the 2U chassis for maximum server density. The compute+storage configuration allows a combination of one server tray and a 1U drive tray. This combination allows up to five 3.5” form factor drives to be installed in the 2U chassis in addition to the compute server. The acceleration+storage compute + I/O configuration is the new addition to iDataPlex. This configuration is intended for customers that require GPU acceleration capabilities for certain high-performance applications that are intensive in floating point and vector calculations, hence they will benefit from GPUs such as NVIDIA’s Tesla cards, which are qualified for iDataPlex. This configuration will allow two GPU cards to be installed in the chassis with a single 1U compute server attached. A special GPU I/O tray will be required for this configuration. The 3U storage iDataPlex chassis is intended for storage-rich applications requiring vast amounts of storage. The 3U chassis will allow up to 12 3.5” form factor SAS/SATA drives of varying capacities to be installed into the chassis. As of this presentation, the biggest size disks available are the 2TB SATA NL drives, which enable up to 24TB of disk space in a 3U iDataPlex chassis. In addition to the chassis configuration options, iDataPlex supports various power supply options, including the traditional 900W, and the new 550W as well as the 750W redundant power supply. Customers can choose the right power supply option that’s suitable to their environments and requirements.
{DESCRIPTION} {TRANSCRIPT} What is new with the dx360 M3? The dx360 M3 provides more performance and better efficiency. The new Intel Westmere-EP processors provide up to 50% more cores with the 6-core design, and also have new lower-powered CPU’s to reduce power. The dx360 M3 allows for lower powered DDR3 memory, allowing customers to further increase efficiency without affecting performance. Where cost is a concern, clients can take advantage of dx360 M3 support for 2 DIMMs per channel at a full 1333MHz bandwidth with 95W processors and 12 DIMMs. This allows the server to maintain maximum performance by utilizing 12 lower capacity DIMMs instead of 6 higher capacity DIMMs, reducing acquisition cost. The dx360 M3 also provides an additional power supply option which provides power redundancy for server and line feed protection. As part of dx360 M3 we’ve also brought in new capacities of hard drives, in 2.5”, 3.5”, SAS, SATA and SSD. For example the new 2TB 3.5” drives provide 24TB of local storage in the 3U chassis. We’ve also introduced new Converged Network adapters with the M3, allowing convergence of Ethernet and Fiber Channel at the server on a single interface. Finally, Trusted Platform Module is standard on the dx360 M3, providing secure key storage for applications such as Microsoft BitLocker.
{DESCRIPTION} {TRANSCRIPT} With the redundant power option, customers can still take advantage of all the optimization for software-resilient workloads, and now take advantage of iDataPlex efficiency for non-grid applications where they desire. The new supply is in the same form factor as the 900W non-redundant supply, with 2 discrete supplies inside the container that are bussed together and 2 discrete line feeds to split power to separate PDU’s. Deploying a full rack of redundant power will require doubling the PDU count, but the vertical slots in the iDataPlex rack can easily accommodate these. Whether the customer’s requirement is line feed maintenance, node protection, or just increased reliability, iDataPlex can now deliver a solution.
{DESCRIPTION} {TRANSCRIPT} The iDataPlex GPU solution takes advantage of iDataPlex efficiency and thermal design to maximize density for GPU compute clusters. GPU cards have a peak power of 225W each, but the iDataPlex server can easily accommodate them within the current design. iDataPlex gives customers a much more efficient solution that will result in more GPU-based servers being deployed per rack within the customer’s power and cooling envelopes, resulting in more GPU’s per rack, less racks required, less power feeds, and ultimately less operating cost. And, with the Rear Door Heat Exchanger, iDataPlex provides the ultimate solution for GPU computing!
{DESCRIPTION} {TRANSCRIPT} This chart depicts the new I/O capabilities of the dx360 M3. A new I/O tray and 3 slot riser will be introduced for the iDataPlex chassis, allowing 2 full height full length 1.5wide cards (such as the NVidia M2050 “Fermi”) in the top of the chassis with x16 connectivity. In addition there is an open x8 slot designed to accommodate a high bandwidth adapter such as Infiniband or 10Gb ethernet, or Converged Network Adapter. The dx360 M3 also has an internal slot that will accommodate a RAID adapter, providing full 6Gbps performance for up to 4 2.5” drives. When compared to outboard solutions, each iDataPlex GPU server is individually serviceable. No longer will clients have to take down 2 servers in the event of a problem with a GPU card. Sparing of GPU’s becomes much simpler, as each card can be replaced individually, instead of an outboard unit that contains 4 cards. The significant I/O capabilities also provide for maximum local storage performance with RAID. And GPU’s will be provided as part of the Cluster Intelligent Cluster integrated solution from IBM, so when there is an issue there is one number to call for resolution.
{DESCRIPTION} {TRANSCRIPT} Storage is an integral piece of every cluster solution. Typical cluster applications utilize Storage for various purposes, including storing large amounts of application data, storage for temporary (scratch) files, storage for cluster databases, cluster OS images, parallel file system (GPFS), and so on. Hence, having an efficient Storage subsystem attached to a cluster is important. There are literally hundreds of storage solutions available in the market today, ranging from simple and cheap, all the way to complex and expensive. The most commonly deployed Storage solutions in clusters are: Disk storage such as direct attached storage (DAS) or Network attached storage (NAS) Storage Area Network (SAN) The complexity and cost of the particular storage solution will depend on various factors, including type of storage, vendor, protocol support (TCP/IP, iSCSI, Fibre-channel), life-cycle management features, management software, performance characteristics, etc. IBM Systems Storage offers several choices when it comes to storage solutions. IBM Storage portfolio consists of entry-level, mid-range, all the way up to enterprise storage solutions. The Intelligent Clusters storage portfolio is restricted to entry-level and mid-range disk systems and SAN storage, which provides customers the option to use the inexpensive direct attached storage such as SAS storage (DS3000) or use a more complex Fibre-channel SAN storage for higher performance (e.g. DS5100/DS5300). In addition, the Intelligent Clusters portfolio includes various third party storage solutions such as switches and adapters (e.g. Brocade, QLogic, Emulex, and LSI).
{DESCRIPTION} {TRANSCRIPT} The picture shows some of the IBM disk storage systems supported in the Intelligent Clusters portfolio. As described in the previous charts, the Intelligent Clusters portfolio supports entry-level disk systems such as DS3200 (SAS/SATA drives) and EXP3000 Storage expansion unit, the mid-range disk systems such as DS3400 (Fibre-channel, SAS and SATA disks) DS3300 (iSCSI interface with SAS disks), and DS3500 (SAS/SATA drives), and mid-range disk systems such as DS5020 and DS5100/DS5300 with Fibre-channel interface.
{DESCRIPTION} {TRANSCRIPT} Cluster Networking will be examined in the following section.
{DESCRIPTION} {TRANSCRIPT} Clusters require one or more network fabrics for inter-node communication and management Typically, clusters use at least one network for management and one network for inter-node communication (compute network). Optionally, there can be additional networks used based on customer and application-specific requirements for performance, security, fault-tolerance, etc. The management network is used for managing various cluster elements, including servers, switches, and storage. A separate, dedicated management network is essential to reliably manage the cluster elements using either in-band or out-of-band communication. In addition, the management fabric is also used for deploying OS images to cluster nodes or network-booting of the OS on the cluster nodes using tools such as xCAT. The management network is also used for monitoring the cluster and gathering performance and utilization information. Gigabit Ethernet is typically used as the management network fabric. A compute network is used for inter-node communication and message-passing applications to send and receive messages across the cluster nodes. The compute network is a dedicated network and is typically only used to carry message-passing traffic to avoid introducing extra overhead and congestion. Often, a high-speed network such as InfiniBand or Myrinet is used for the compute network to provide a high-bandwidth and low-latency fabric. In some cases where bandwidth and latency are not a major concern, a Gigabit Ethernet or 10 Gigabit Ethernet (10GbE) network is also employed for the compute network. The user or campus network is the external network to which a cluster is attached so that users login to the cluster to run their jobs. The user network is not part of the cluster, but administrators need to provide a secure and reliable interface to the cluster from the user/campus network. In addition to the management, compute, and user fabrics, there can be optionally other networks used in clusters, such as a Storage network, which is used to interconnect servers facing Storage subsystem (referred to as Storage nodes),
{DESCRIPTION} This picture shows the intelligent cluster InfiniBand portfolio. {TRANSCRIPT} This picture shows the intelligent cluster InfiniBand portfolio.
{DESCRIPTION} {TRANSCRIPT} The Intelligent Clusters portfolio offers a wide range of options for networking. 1350 partners with several OEM vendors for network switches, adapters, and cables. As shown in the picture, the 1350 Ethernet switch portfolio consists of 1 Gigabit Ethernet switches from vendors such as SMC networks, Cisco, Blade Network Technologies, and Force 10. Customers have a range of choices when it comes to Ethernet networking to be used in Intelligent Clusters clusters. The 1350 pre-sales cluster architect can pick and choose which particular switches o use in the cluster solution, depending on customer preferences and application needs. Special care must be taken in order to address the performance, scalability and availability requirements for applications and users when architecting the cluster network fabric.
{DESCRIPTION} {TRANSCRIPT} This chart shows the Ethernet entry/leaf/top of rack switches qualified for the iDataPlex solution.
{DESCRIPTION} {TRANSCRIPT} When discussing Cluster network, often one comes across the concept of a centralized Vs distributed network. Clusters typically use one of these two architectures for their network fabrics. In case of a centralized network topology, there are one or more centralized switches, and all cluster elements including servers, storage and others are connected directly to the central switches. There are no intermediate hops to go to the central switches and all elements communicate to others via the central switches. On the other hand, with a distributed network topology, the network architecture has multiple tiers. Typically, there are two tiers – the core/aggregation tier, and the access/leaf tier. The core/aggregation tier consists of core switches that connect to the access/leaf tier switches via Inter-switch links (ISLs). The leaf switches are the smaller size switches placed inside the individual racks in a cluster and they connect directly to the cluster nodes. All nodes communicate via the leaf switches, which are aggregated at the core/aggregation point via the core switches. The right approach for networking – centralized or distributed, depends on various factors, including cluster size, performance requirements, and cost. Typically, a distributed network topology is used when the size of the cluster is big – e.g. hundreds and thousands of nodes because the distributed network scales well and is easy to expand in future. On the other hand the centralized network is a good choice for small size clusters of tens of nodes. As shown in the picture, the Intelligent Clusters portfolio provides several core switch choices with high port counts, from vendors such as Force10 and Cisco. In addition to the switches, the 1350 BOM also contains several high-speed network adapter choices, such as Chelsio 10GbE PCI-E adapters and Blade daughter cards.
{DESCRIPTION} {TRANSCRIPT} High-speed networking is important for cluster applications where bandwidth and latency are critical for performance. Traditional Gigabit Ethernet network will not deliver on these requirements, due to the relatively low bandwidth and latencies in the order of a few milliseconds. Hence, HPC clusters typically employ some type of high-bandwidth, low-latency network fabric to meet the performance requirements for applications. Today, the primary choices for high-speed networking for Clusters are the following: InfiniBand 10 Gigabit Ethernet InfiniBand InfiniBand is an industry standard low-latency, high-bandwidth server interconnect, ideal to carry multiple traffic types (clustering, communications, storage, management) over a single connection. A switch-based serial I/O interconnect architecture operating at a base speed of 2.5 Gb/s, 5 Gb/s, or 10 Gb/s in each direction (per port) Provides highest node-to-node bandwidth available today of 40Gb/s bidirectional with Quadruple Data Rate (QDR) technology Lowest end-to-end messaging latency in micro seconds (1.2-1.5 µsec) Wide-industry adoption and multiple vendors (Mellanox, Voltaire, QLogic, etc.)
{DESCRIPTION} {TRANSCRIPT} InfiniBand (IB) is an industry standard server interconnect technology developed by a consortium of companies as part of the InfiniBand Trade Association (IBTA). InfiniBand defines the standard for a low latency, high-bandwidth point-to-point server interconnect technology. The low-latency in the order of 1.2 microseconds at the application level can be achieved using the Remote Direct Memory Access (RDMA) protocol for communication across the servers, which bypasses the standard kernel protocol layers in the Operating system and gives direct access to memory on the remote system. The high-bandwidth of InfiniBand fabric in the order of 40Gb/s (with QDR technology) is achieved via the serial bus interface with each of the lanes supporting up to 10Gbps bi-directional bandwidth. InfiniBand specification defines various speeds for the fabric, depending on the purpose of the link. For example, the most commonly used link width is 4x, which corresponds to “four lanes” of the IB serial links. This link width is used for the connectivity between the servers and switches. On the other hand, links of 12x width are typically used as inter-switch links. As the IB technology advanced over the years, the serial link speed kept doubling every two to three years. Correspondingly, the technology was termed “SDR” (single-data rate), “DDR” (double-data rate), “QDR” (quad-data rate) and so on. Currently, the QDR link speed is 10Gbps bidirectional. Hence, the 4x QDR link gives 40Gbps bi-directional bandwidth. High-performance computing applications utilizing a parallel middleware library such as Message Passing Interface (MPI) typically use the native RDMA protocol enabled by InfiniBand fabric adapters and switches for communication in order to achieve low-latency and high-bandwidth communication across processing running on different servers in the cluster. There are multiple vendors that make adapters, switching and associated gear, as well as software for InfiniBand. Intelligent Clusters portfolio carries several vendor options to support IB to be sold as an integrated high-speed interconnect option for clusters.
{DESCRIPTION} {TRANSCRIPT} This chart shows the various InfiniBand switch and adapter options from Voltaire, QLogic, and Mellanox, which are supported in the Intelligent Cluster bill of materials. The newly introduced QLogic QDR core switches support up to 864 ports in a single chassis, which is typically used in large scale IB clusters with hundreds of nodes. The QDR leaf switches support up to 36 non-blocking IB ports Dual-port QDR IB adapters are available from Mellanox and QLogic. These adapters are available for both rack-mount servers and Blades.
{DESCRIPTION} {TRANSCRIPT} 10GbE or 10GigE is an IEEE Ethernet standard 802.3ae, which defines Ethernet technology with data rate of 10 Gbits/sec. 10GbE technology enables applications to take advantage of interconnect speeds ten times (10x) faster than the traditional 1 Gigabit Ethernet technology. The main advantage of 10GbE technology is that it requires no changes to the application code, which was originally written for 1 Gigabit Ethernet (provided the underlying OS and hardware support 10GbE fabric). 10GbE is picking up momentum as the high-speed interconnect choice for “loosely-coupled” message-passing HPC applications that traditionally used 1GbE technology as the interconnect. 10GbE technology is less expensive and easier to deploy than other high-speed networking options such as InfiniBand or Myrinet. There is wide industry support for 10GbE technology in terms of adapters and switches with growing user adoption. 10GbE is fast becoming the choice for Data Center Ethernet (DCE) and the emerging Fibre Channel Over Ethernet (FCoE) technologies with the Converged Enhanced Ethernet (CEE) standard. Intelligent Clusters supports 10GbE technologies for both node-level and switch-level support, providing multiple vendor choices for adapters and switches (BNT, SMC, Force10, Brocade, Cisco, Chelsio, etc.)
{DESCRIPTION} {TRANSCRIPT} Cluster Management, Software Stack and Benchmarking will discussed next.
{DESCRIPTION} {TRANSCRIPT} xCAT stands for Extreme Cluster (Cloud) Administration Toolkit. xCAT is an open source Linux/AIX/Windows Scale-out Cluster Management Solution, primarily developed and tested by IBM xCAT key design principles are to: Build upon existing technologies Leverage best practices for provisioning and managing large-scale clusters and Cloud-type infrastructure Implement based on scripts only without any compiled code in order to make it portable, and make the source code available xCAT core capabilities are: Remote Hardware Control Power on/off/reset, vitals, inventory, event Logs, and SNMP alert processing Remote Console Management Serial Console, SOL, Logging / Video Console (no logging) Remote Boot Control Local/SAN Boot, Network Boot, and iSCSI Boot Remote Automated Unattended Network Installation Auto-Discovery of nodes through intelligent switch integration MAC Address Collection Service Processor Programming Remote BIOS/firmware Flashing Kickstart, Autoyast, Imaging, Stateless/Diskless, iSCSI With all these features, xCat provides a comprehensive, flexible yet powerful cluster management solution that’s developed and tested on some of the biggest IBM clusters to date.
{DESCRIPTION} {TRANSCRIPT} The IBM GPFS stands for General Parallel File System. GPFS is a cluster file system developed by IBM originally targeted for high-performance computing environments to eliminate some of the core performance and scalability faced by customers using traditional file systems such as NFS. GPFS provides significant performance and scalability advantages over traditional file systems and other cluster file systems in the market due to its architecture and the evolution over the years. Today, GPFS is used as the premium choice when designing Storage for Clusters as well as the emerging Cloud computing environments. Some of the important features of GPFS are: GPFS provides fast and reliable access to common set of file data from a single computer to hundreds of systems Brings together multiple systems to create a truly scalable cloud storage infrastructure GPFS-managed storage improves disk utilization and reduces footprint energy consumption and management efforts GPFS removes client-server and SAN file system access bottlenecks All applications and users share all disks with dynamic re-provisioning capability GPFS is developed and sold as a commercially licensed product by IBM
{DESCRIPTION} {TRANSCRIPT} GPFS provides shared storage to the cluster nodes and a common cluster-wide parallel file system. The parallelism in GPFS comes from it’s ability to provide concurrent shared access to the same files from multiple nodes in the cluster, which improves the file access performance significantly over traditional techniques. GPFS is available on a wide range of platforms and operating systems, including IBM pSeries and xSeries servers, and AIX, Linux and Windows. GPFS is currently used on some of the largest Supercomputers in the world, consisting of 100s of nodes as the parallel file system. GPFS has been demonstrated to scale beyond 2400 nodes, without any performance degradation or loss of data. GPFS provides a single administrative control point and most of the GPFS commands can be executed from any node in the cluster, which simplifies administration and provides flexibility. Shared disk: All data and metadata on disks is accessible from any node through a unique and consistent “disk I/O” interface. Parallel access: Data and metadata is accessible from all nodes in the cluster at any time and in parallel to improve performance.
{DESCRIPTION} {TRANSCRIPT} A cluster resource manager manages the nodes and other hardware resources in the cluster. The resource manager helps streamline resource requests from users by reserving resources and executing jobs on the cluster nodes. A resource manager is usually combined with a job scheduler, which interfaces with the resource manager to allocate resources to various user jobs based on the job requirements. The Job scheduler makes complex decisions when picking the next job to run from the queue of jobs based on various job attributes such as job priorities, fairshare policies, type of resources requested, resource availability, reservations, etc. There are several cluster resource managers and job schedulers available in both public domain (open source) and commercial. Torque is an open source portable resource manager/batch job scheduler. Although the base scheduler only comes with a few standard scheduling algorithms such as FIFO, Torque works in conjunction with another advanced job scheduler such as Maui (also open source) or Moab (commercial version of Maui). When used in conjunction with the scheduler, Torque acts as the resource manager that is able to control the cluster resources and for executing and managing jobs on the nodes. Other job schedulers and resource managers popularly used in cluster environments are Load Sharing Facility (from Platform Computing), Sun Grid Engine (from Sun Microsystems), Condor (from U of Wisconsin), Moab Cluster Suite (from Cluster Resources), and Load Leveler (from IBM).
{DESCRIPTION} {TRANSCRIPT} Message passing libraries are used as the programming API for developing applications to run on clusters. Typically HPC applications are written using the Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) libraries, which provide the abstraction layer that presents a virtual interface to model clusters for writing parallel code that runs across multiple nodes in the cluster. MPI is a portable parallel programming interface specification developed by a consortium of academic, government and commercial companies for enabling writing parallel and cluster applications that can be easily ported across multiple hardware as well as operating system platforms. There are various open source and commercial versions of the MPI library implementation available, including MPICH2, LAM, Scali and OpenMPI. Many of these implementations support multiple different networks as the underlying communication fabric, including Ethernet, InfiniBand, and Myrinet. Code written using MPI is portable across different networks, often requiring no changes to the source code (although the application might need to be recompiled and linked against the right network support library). Parallel Virtual Machine (PVM) is an open source library, which provides a virtual view of the cluster such that programmers can write code against this virtual “single system” model and hence don’t have to be concerned with a particular cluster architecture.
{DESCRIPTION} {TRANSCRIPT} Compilers are critical in creating an optimized binary code that takes full advantage of the specific processor architectural features such as the CPU and memory architecture, execution units, pipelining, co-processors, registers, shared memory, etc., such that the application can exploit the full power of the system and runs most efficiently on the specific hardware platform. Typically, processor vendors of the respective processors such as Intel Xeon, AMD Opteron, IBM POWER, and Sun Sparc, etc., have the best compilers for their processors: Intel Compiler Suite AMD Open64 Compilers IBM XL C/C++ and Fortran Compilers Compilers are important to produce the best code for HPC applications as individual node performance is a critical factor for the overall cluster performance. Optimizing code for the specific processor used in the cluster nodes will ensure optimal performance on individual systems, which will in turn help the overall application performance when running on multiple systems on the cluster. In addition to the vendor specific compilers, open source as well as other commercial compilers are available, which are commonly used for compiling HPC applications. For example, the GNU GCC compiler suite (C/C++, Fortran 77/90), which is part of the standard Linux distributions, and the PathScale compiler suite (which is currently owned by QLogic) and sold for a fee. Other support libraries and debugger tools are also commonly packaged and made available with the compilers, such as Math libraries (e.g. Intel Math Kernel Libraries and the AMD Core Math Library), and debuggers such as GDB (GNU debugger) and the TotalView debugger from TotalView technologies, which is used for debugging parallel applications on clusters.
{DESCRIPTION} {TRANSCRIPT} This table summarizes various software tools, compilers and libraries available for utilizing on clusters.
{DESCRIPTION} {TRANSCRIPT} This table summarizes various software tools, compilers and libraries available for utilizing on clusters. As is evident from the table, there is vast availability of software tools for developing applications for clusters.
{DESCRIPTION} {TRANSCRIPT} Benchmarking is a technique for running some well-known reference applications on clusters in order to exercise various system components and measuring the performance characteristics of the cluster (e.g. network bandwidth, latency, FLOPs, etc.). Benchmarking allows cluster users and administrators to measure the performance and scalability aspects of clusters and to address critical bottlenecks by isolating bad hardware and tuning applications to optimally take advantage of the hardware Several public domain and commercial benchmarking tools are available for clusters: STREAM is a micro benchmark used to measure the memory throughput and latency on individual cluster nodes. STREAM is useful in finding the “skew” in the cluster by exposing nodes with memory performance inferior to the expected values. Linpack is an open source cluster benchmark application, which is used to predict the sustained aggregate Floating Point Operations per Second (FLOPS) from all the cluster nodes by solving a dense system of linear equations in parallel on the cluster. Linpack uses double-precision floating point arithmetic operations and the basic linear algebra subroutines (BLAS) library for solving the linear equations. Hence, Linpack is a good exerciser of the CPUs, memory and the network subsystem of the cluster. Linpack results are used as the basis for determining the fastest supercomputers in the world, which is maintained by top500.org website. The other commonly used benchmarks on clusters include HPC challenge benchmark, the SPEC suite of benchmarks from Standard Performance Evaluation Corporation (commercial), the NAS Parallel benchmarks developed by NASA, the Intel MPI benchmark (IMB), etc. One of the key recommendations for cluster users is to use their own codes/applications for benchmarking clusters because ultimately, the users’ code and applications are what are executed on clusters on a daily basis and any performance tuning or improvements necessary in these codes is best judged by running the respective codes on the cluster and then addressing the potential bottlenecks to improve application performance.
{DESCRIPTION} {TRANSCRIPT} In this course we presented the following topics: A Cluster system is created out of commodity server hardware, high-speed networking, storage and software technologies. High-performance computing (HPC) takes advantage of cluster systems to solve complex problems in various industries that require significant compute capacity and fast compute resources. IBM Intelligent Clusters provides a one-stop-shop for creating and deploying HPC solutions using IBM servers and third party Networking, Storage and Software technologies. InfiniBand, Myrinet (MX and Myri-10G), and 10Gigabit Ethernet technologies are more commonly used as the high-speed interconnect solution for Clusters. IBM GPFS parallel file system provides a highly-scalable, and robust parallel file system and storage virtualization solution for Clusters and other general-purpose computing systems. xCAT is an open-source, scalable cluster deployment and Cloud hardware management solution. Cluster benchmarking enables performance analysis, debugging and tuning capabilities for extracting optimal performance from Clusters by isolating and fixing critical bottlenecks. Message-passing middleware enables developing HPC applications for Clusters. Several commercial software tools are available for Cluster computing.
{DESCRIPTION} {TRANSCRIPT} This slide presents a glossary of acronyms and terms used in this topic.
{DESCRIPTION} {TRANSCRIPT} To learn more about Intelligent Clusters, please visit any of the resources presented in the slide.
{DESCRIPTION} {TRANSCRIPT} The following are trademarks of the International Business Machines Corporation in the United States, other countries, or both.