Dsmp Whitepaper V5
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
598
On Slideshare
596
From Embeds
2
Number of Embeds
1

Actions

Shares
Downloads
3
Comments
0
Likes
0

Embeds 2

http://www.linkedin.com 2

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. D S M P Addressing the limitations of Message Passing Interface with a unique Distributed Shared Memory Application By Peter Robinson Symmetric Computing Venture Development Center University of Massachusetts - Boston Boston MA 02125 Page 1
  • 2. D S M P This page is intentionally blank Page 2
  • 3. D S M P Overview Today, the language-independent communications protocol - Message Passing Interface or MPI is the de facto standard for most supercomputers. However, problems solved on these clusters must be decimated to fit within the physical limitations of the individual nodes and modified to accommodate the clusters hierarchy and messaging scheme. As of 1Q10, quad-socket Symmetric Multiprocessing (SMP) processing nodes which make-up these MPI clusters can support 24-x86 cores and 128GB of memory. However, addressing problems with big data-sets is still impractical for the MPI-1 model for it has no shared memory concept, and MPI-2 has only a limited distributed shared memory (DSM) concept. Even an MPI- 2 cluster based upon these state-of-the-art SMP processing nodes cannot support problems with very large data-sets without significant restructuring of the application and associated data-sets. Even after a successful port, many programs suffer poor performance due to MPI hierarchy and message latency. The Symmetric Computing - Distributed Symmetric Multiprocessing™ (DSMP™) Linux kernel enhancement enables Distributed Shared Memory (DSM), or distributed global address space (DGAS), across an InfiniBand connected cluster of Symmetric Multiprocessing (SMP) nodes with breakthrough price/performance. The Symmetric Computing - Linux kernel enhancement transforms a homogeneous cluster into a DSM/DGAS supercomputer which can service very large data-sets or accommodate legacy MPI applications with increased efficiency and throughput via application-utilities that support MPI over shared-memory. DSMP is poised to displace and obsolete message-passing as the protocol of choice for a wide range of memory intensive applications because of its ability to service a wider class of problems with greater efficiency. DSMP is comprised of two unique operating systems; the host OS which runs on the head-node and a unique lightweight micro-kernel OS which runs on all other servers (which make-up the cluster). The host OS consists of a Linux image plus a new DSMP kernel, creating a new derivative work. The micro-kernel is a non-Linux based operating system that extends the function of the host OS over the entire cluster. These two OS images (host and micro-kernel), are designed to run on commodity, Symmetric Multiprocessing (SMP) servers based on either the AMD or Intel direct connect architecture, i.e., AMD Opteron™ processor or Intel Nehalem™ Processor. DSMP enables at the kernel level, a shared-memory software architecture that scales to hundreds of thousands of cores based on commodity hardware and InfiniBand. The key features that enable this scalable DSM architecture are: • A DSMP kernel level enhancement that results in significantly lower latency and improved bandwidths, making a DSM/DGAS architecture both practical and possible; • A transactional distributed shared-memory system, which allows the architecture to scale to thousands of nodes with little to no impact on global-memory access times; • An intelligent, optimized InfiniBand driver which leverages the HCA’s native Remote Direct Memory Access (RDMA) logic, reducing global memory page access times to under 5µ-seconds; • An application driven, memory-page coherency scheme that simplifies porting applications and allows the programmer to focus on optimizing performance vs. decimating the application to accommodate the limitations of the message-passing interface; Page 3
  • 4. D S M P MPI vs. DSM Supercomputer vs. the DSMP™ Cluster As stated earlier, although MPI clusters are the de facto platform of choice, data-sets in bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for the SMP nodes that make up the cluster and in many cases, it is impractical and inefficient to decimate the large data-sets. The alternatives are to proceed anyway with a full restructuring of the problem and suffer the inefficiencies, or to purchase time on a University or National Labs DSM supercomputer. The problem with the DSM supercomputer approach is the prohibited cost of the hardware or the lengthy queue-time incurred to access a NSF/DoE DSM Supercomputer. Additionally, the system requirements of a researcher looking to model a physical system or assemble/align an DNA sequence MPI DSM DSMP™ are quite different from enterprise Cluster Supercomputer Cluster computing. In short, they Don’t need a Commodity Yes No Yes hardened enterprise class nine 9’s reliable Hardware Support for Limited Yes Yes platform with virtualization support because DSM their applications are primarily single- Intelligent Platform Yes Yes Yes Mngt. Interface process, multiple-thread. In addition, they Virtualization No Yes No are more than willing to optimize their support applications for the target hardware to get Static partition for No Yes Planned multi app. support the most out of the run. Ultimately they Affordability $ $$$$$ $ want unencumbered 24/7 access to an factor Incrementally Yes No Yes affordable DSM supercomputer – just like Expandable their MPI cluster. The table to the right Support for Yes No Yes >10K cores summarizes the differences between the three approaches. Enter Symmetric Computing The design team of Symmetric Computing came out of the research community. As such, they are very aware of the problems researchers face today and in the future. This awareness drove the development of DSMP™ and the need to leverage commodity hardware to implement a DSM/DGAS supercomputer. Our intent is nothing short of having DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster-computing, which enabled thousands of researchers and Universities to solve massively complex problems on commercially available hardware – Supercomputing for the Masses. How DSMP™ works As stated in the introduction, DSMP™ is software that transforms an InfiniBand connected cluster of homogeneous 1U/4P commodity SMP servers into a shared-memory supercomputer. Although there are two unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference between them because, from the programmers perspective, there is only one OS image and one kernel. The DSMP kernel provides seven (7) enhancements that enable this transformation, they are: Page 4
  • 5. D S M P 1. A transactional distributed shared-memory system; 2. Optimized InfiniBand drivers; Trio™ Departmental 3. An application driven, memory-page coherency scheme; Supercomputer 4. An enhanced multi-threading service; 5. Support for a distributed MuTeX; 6. A memory based distributed disk-queue and The transactional distributed shared-memory system: The center piece of DSMP is its transactional distributed shared-memory architecture, which is based on a two tier memory-page (Local and Global) and tables that support the transactional memory architecture. Just two years ago, such an approach would have seemed inefficient and a poor use of memory. However, today memory is the most abundant resource within the modern server, and leveraging this resource to implement a shared memory architecture is not only practical but very cost effective. The transactional global shared memory is concurrently accessible and available to all the processors in the system. This memory is uniformly distributed over each server in the cluster and is indexed via tables, which maintain a linear view of the global-memory. Local-memory contains copies of what is in global- memory – acting as a memory page cache, providing temporal locality for program and data. All processor code execution and data reads occur only from the local memory (cache). 128GB 128GB 128GB Global Transactional Shared Memory Based on 4,096B pages 16 - 32GB 16-32 GB 16-32 GB Local Local Local Memory #1 Memory #3 Memory #2 (cache) (cache) (cache) Based on 64B Cache lines IB 3 IB2 IB1 IB 2 IB 1 IB 3 3 - 1TB GbE SMP 0 GbE GbE SMP 2 SMP 1 HDD per server Shown above is a data flow view of the Trio™ departmental supercomputer which is based on three 1U - 4P servers with 48 or 72 processor cores and up to 128 giga-bytes of physical memory per node. The three nodes in Trio are connected via 40Gb InfiniBand and there is no switch. The size of the local- memory is set at boot time but is typically between one (1) and two (2) GB per-core or greater Page 5
  • 6. D S M P (application driven). When there is a page-fault in local-memory, the DSMP kernel finds an appropriate least recently used (LRU) 4K memory-page in local-memory and swaps in the missing global-memory page; this happen in just under 5µ-seconds. The large temporally local-memory (cache) provides all the performance benefits (SREAMS and Linpack) of local-memory in a legacy SMP server with the added benefit spatial locality to a large globally shared-memory which is <5µ-seconds away. Not only is this architecture unique and extremely powerful, but it can scale to hundreds and even thousands of nodes with no appreciable loss in performance; so long as a globally shared-memory page is <5µ-seconds away, performance scales infinitely. The Optimized InfiniBand Drivers: The entire success of DSMP™ revolved around the availability of a low latency, commercial network fabric. It wasn’t that long ago, with the exit of Intel from InfiniBand, that industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most High Performance Computing (HPC) clusters - due to its low latency & high bandwidth. To squeeze every last nano-second of performance out of the fabric, the designers of DSMP bypassed the Linux InfiniBand protocol stack and wrote their own low-level drivers. In addition, they developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel adapter (HCA). This allowed the HCA to service and move memory-page requests, without processor intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system- wide latency. An application driven, memory page coherency scheme: As stated in the introduction, all proprietary supercomputers maintain memory-consistency and/or coherency via a hardware extension of the host processors coherency scheme. DSMP being a software solution, based on a local and global-memory resource, had to take a different approach. Coherency within the local-memory of each of the individual SMP servers is maintained by the AMD64 Memory Management Unit (MMU) on a cache-line basis. Global-memory page coherency and consistency, is controlled by, and maintained under program control; shifting responsibility from hardware to the programmer or the application. This approach may seem counter intuitive at first, but it is the most efficient way to implement system-wide coherency based on commodity hardware. Again, the target market-segment for DSMP is technical computing, not enterprise and in most cases, the end user is familiar with their algorithm and how to optimize it for the target platform - in the same way code was optimized for a Beowulf MPI cluster. Given that most applications are open-source combined with the high skill level of the end users, drove system level decisions which has kept DSMP based clusters affordable, fast and scalable. In cases where the end user is not computer literate, or does not have access to a staff of computer scientists, Symmetric Computing can either provide a user access to open-source algorithms already optimized for DSMP or we can work with your team to modify the algorithm for you. To assist the programmer in maintaining memory-consistency, a set of DSMP specific primitives were developed. These primitives, combined with some simple intuitive programming rules, augmented with Page 6
  • 7. D S M P the new primitives make porting an application to a multi-node DSMP platform simple and manageable. Those rules are as follows: • Be sensitive to the fact that memory-pages are swapped into and out of local memory (cache) from global memory in 4K pages and that it takes <5µ-seconds to complete the swap. • Be careful not to overlap or allocate multiple data sets within the same memory page. To help prevent this a new malloc( ) function [a_malloc( )] is provided to assure alignment on a 4k boundary and to avoid FALSE sharing. • Because of the way local and global-memory are partitioned within the physical memory, care should be taken to distribute process/threads and associated data-sets evenly over the four processors while maintaining temporal locality to data. How you do this is a function of the target application. In short, care should be taken to not only distribute threads but ensure some level of data locality. DSMP™ supports full POSIX conformance to simplify parallelization of threads. • If there is a data-structure which is “modified-shared” and if that data structure is accessed by multiple process/threads which are on an adjacent server, then it will be necessary to use a set of new primitives to maintain memory-consistency i.e., Sync( ), Lock( ) and Release( ). These three primitives are provided to simplify the implementation of system wide memory-consistency. - Sync( ) synchronizes a local data-structure with its parent in global-memory. - Lock( ) prevents any other process thread from accessing and subsequently modifying the noted data-structure. Lock( ) also invalidates all other copies of the data structure (memory- pages) within the computing-system. If a process thread on an adjacent computing-device accesses a memory-page associated with a locked data structure, execution is suspended until the structure (memory page) is released. - Release( ) unlocks a previously locked data structure. NOTE: If your application is single thread, or is already parallelized for OpenMP or Pthreads, then it will run on a DSMP™ system (such as Trio™) without modification. The only limitation is that for a multi-thread application you may not be able to take advantage of all the processor cores on the adjacent worker nodes (only the processor-cores on the head-node). Hence, in the case of Trio™ you will have access to all 16/24 processor cores on the head and up to 336GB of global shared memory with full coherency support. This ability to run your single thread or OpenMP/Pthreads applications, and take full advantage of DSMP™ transactional distributed shared-memory, without modifying your source, provides a systematic approach to full parallelization. It should be noted that the current implementation of DSMP will not support PMI applications. Sometime in 2H10, we plan to release a wrapper that will support MPI over a DSMP shared-memory architecture. Multi-Threading: The “gold standard” for parallelizing C/C++ or Fortran source code is with OpenMP and the POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two Page 7
  • 8. D S M P dozen or so POSIX routines were either tested to and/or modified for DSMP and the Trio™ platform so that the POSIX.1 standard is supported in its entirety. Distributed MuTeX: A MuTeX or Mutual exclusion is a set of algorithms that are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable or a critical section. A distributed MuTeX is nothing more than a DSMP kernel enhancement that insures that a MuTeX functions as expected, within the DSMP multi-node system. From a programmers point-of-view, there are no changes or modification to MuTeX – it just works. Memory based distributed disk-queue: DSMP provides a high-bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a memory (DRAM) based elastic storage buffer which effectively eliminates bottlenecks which occur when multiple threads compete for a low bandwidth resource. A Perfect-Storm Every so often, there are advancements in technology that impact a broad swath of the society, either directly or indirectly. These advancements emerge and take hold due to three events: 1. A brilliant idea is set-upon and implemented which solves a “real-world” problem; 2. A set of technologies have evolved to a point that enable the idea and; 3. The market segment which directly benefits from the idea is ready to accept it. The enablement of a high performance shared-memory architecture over a cluster is such an idea. As implied in the previous section, the technologies that have allowed DSMP to be realized are: • The adoption of Linux as the operating system of choice for technical computing; • The commercial availability of a high bandwidth, low latency network fabric – InfiniBand; • The adoption of the x86-64 processor as the architecture of choice for technical computing; • The integration of the DRAM controller and I/O complex onto the x86-64 processor; • The sudden and rapid increase of DRAM density with the corresponding drop in memory cost; • The commercial availability of high-performance small form-factor SMP nodes. If any of the above six advancements had not existed, Distributed Symmetric Multiprocessing would not have succeeded. DSMP Performance Performance of a supercomputer is a function of two metrics: 1) Processor performance (computational throughput) which can be assessed with HPPC Linpack or FFT algorithms and; 2) Global Memory Read/Write performance - which can assessed with HPPC STREAMS or Random. The extraordinary thing about the DSMP™ is the fact that it is based on commodity components. That’s important, because as with MPI, DSMP performance scales with the performance of the commodity Page 8
  • 9. D S M P components from which it is dependent on. As an example, random read/write latency for DSMP, went down 40% when we moved from a 20Gb to a 40Gb fabric and no changes to the DSMP software were needed. Also, within this same timeframe, the AMD64 processor density went from quad-core to six-core, again with only a small increase in the cost of the total system. Therefore, over-time the performance gap between a DSMP™ and a proprietary SMP supercomputer of equivalent processor/memory density will continue to narrow – and eventually disappear. Looking back to the birth of the Beowulf project, an equivalent processor/memory density SMP supercomputer out performed a Beowulf cluster by a factor of 100, yet MPI clusters continued to dominate supercomputing – why? The reasons are twofold: First – price performance: MIP clusters are affordable – both from the hardware as well as a software perspective (Linux based), and they can grow in small ways - by adding a few additional PC/servers or in large way, by adding entire racks of servers. Hence, if a researcher needs more computing power, they simply add more commodity PC/servers. Second – accessibility: MPI clusters are inexpensive enough that almost any University or College can afford them, making them readily accessible for a wide range of applications. As a result, an enormous academic resource is focused on improving MPI and the applications that run on them. DSMP™ brings the same level of value, accessibility and grass-roots innovation to the same audience that embraced MPI. However, the performance gap between a DSMP cluster and a legacy shared- memory supercomputer is small, and in many applications, a DSMP cluster out performs machines that cost 10x its price. As an example, if your application can fit (program and data) within local-memory (cache) of Trio, then you can apply 48 or 72 processors concurrently to solve your problem with no penalty for memory accesses (all memory is local for all 48 or 72 processors). Even if your algorithm does access global shared memory, it’s only 5µ-seconds away. Today, Symmetric Computing is offering four turn-key DSMP proof of concept platforms coined the Trio™ Departmental Supercomputer line and one dual node system coined Duet™. The Table below lists the various Giga-flop/memory configurations available in 2010. Trio™ Peak Theoretical Linpack score in Total system Total Shared P/N floating-point Local-Memory memory Memory* performance (75%) SCA241604-3 749 Giga-flops 562 Giga-flops 192 GB 120GB SCA241604-2 538 Giga-flops 404 Giga-flops 256 GB 208GB SCA241608-3 749 Giga-flops 562 Giga-flops 384 GB 312GB SCA161608-3 538 Giga-flops 404 Giga-flops 384 GB 336GB * Note: In this table, local memory (cache) is set to 1GB/core – a total of 16GB for quad-core and 24GB for six-core processors Page 9
  • 10. D S M P Looking forward to 2Q10, the Symmetric Computing engineering staff will introduce a multi-node InfiniBand switched based system delivering almost 2 tera-flops of peak throughput with more than a terra-byte global-memory in a 10-blade chassis. In addition, we are working with our partners to deliver a turn-key platform tuned for application specific missions – such as HMMER, BLAST and de novo Align/Assembly algorithms. Challenges and Opportunities Symmetric Computing’s most significant challenge is to convince users that there is real value in its DSMP Linux kernel enhancement over MPI. Much of the HPC market has become complacent with MPI and look for incremental improvement i.e., MPI-1 to MPI-2; which continues to be the path of least resistance. In addition, as MPI nodes convert to the AMD Opteron™ processor and Intel Nehalem™ processor’s direct connect architecture, combined with memory densities of >256GB/node, MPI-2 with DSM support will gradually service a wider range of application. Symmetric Computing will addresses these issues by first supplying affordable turn-key DSMP enabled hardware – such as Trio™. These platforms will be optimized to solve problems in narrow vertical markets such as Bioinformatics at an unprecedented price-performance level. We will then enable the wider market with the availability of larger turn-key platforms with a version of OpenMP optimized for DSMP and APIs that allow MPI applications to run on these turn-key shared-memory platforms with improved performance. In the short-term, Symmetric Computing will continue to focus on academic & research supercomputing. We will continue to release turn-key platforms focused on specific verticals where we have an advantage. Our long-term strategy is to continue to displacement MPI with DSMP for applications where DSM/DGAS is as the architecture of choice delivering - Supercomputing for the Masses. About Symmetric Computing Symmetric Computing is a Boston based software company with offices at the Venture Development Center on the campus of the University of Massachusetts – Boston. We design software to accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas, Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to delivering standards-based, customer-focused technical computing solutions for users, ranging from Universities to enterprises. For more information, visit www.symmetriccomputing.com. Page 10