Dsmp Whitepaper Release 3


Published on


  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Dsmp Whitepaper Release 3

  1. 1. D S M P Overview of the Distributed Symmetric Multiprocessing Software Architecture By Peter Robinson Technical Marketing Manager Symmetric Computing Venture Development Center University of Massachusetts - Boston Boston MA 02125 Page 1
  2. 2. This page intentionally left blank Page 2
  3. 3. Introduction Distributed Symmetric Multiprocessing or DSMP, is a new kernel extension or kernel enhancement, that extends the capabilities of the legacy Linux operating system, so it can support a scalable, shared-memory architecture over a 40Gb InfiniBand attached cluster. DSMP is comprised of two unique software components; the host operating system (OS) System Call Interface (SCI) which runs on the head-node and a unique lightweight micro-kernel Process Virtual File OS which runs on all “other” servers (which make-up the cluster). Management (PM) System (VFS) The host OS consists of a Linux image plus a new DSMP kernel, Memory Network creating a new durative work as noted in Figure 1. The micro-kernel Management (MM) Stack is a non-Linux based image that extends the function of the host OS DSMP Gasket Interface over the entire cluster. These two OS images (host and micro- kernel), are designed to run on commodity, Symmetric Device ARCH Drivers Multiprocessing (SMP) servers based on the AMD64 processor. (DD) The AMD64 architecture was selected over competing platforms for a number of reasons, the primary being Figure 1 – Host DSMP Software architecture price performance. Back in 2005 when we conceived DSMP, the AMD Opteron™ Processor was the only x86 solution that supported a high density, 4P direct connect architecture in a 1U form-factor. As of 4Q09, AMD continues to provide the best value for 4P 1U servers and they continue to offer the only commercially viable 4P solution on the market today. A look at supercomputing today Supercomputing can be divided into two camps - proprietary shared-memory systems or commodity message passing Interface (MPI) clusters. Shared memory systems are based on commodity processors such as the PowerPC or Itanium or the ever-popular x86 and commodity memory (DRAM SIMMs). At the core of most shared-memory systems is a proprietary fabric. This fabric physically extends the host processors coherency scheme over multiple nodes, providing low-latency inter-node communication while maintaining system wide coherency. These ultra expensive, hardened shared- memory supercomputers are designed to accommodate concurrent, enterprise or transactional processing applications. These applications; VMware, Oracle, dbase, SAP, etc. can utilize one to 512+ processor- cores and tera-bytes of shared-memory. Most of these applications are optimized for the host OS and the micro-architecture of the host processor, but not for the macro architecture of the target system. Shared- memory systems are also a great deal easer to develop applications for. In fact, rarely is it ever necessary to modify code-sets or data-sets to run on a shared-memory system, for most SMP software plugs and plays, which is why the shared-memory supercomputers are in such high demand. Page 3
  4. 4. Conversely, MPI clusters are comprised entirely of commodity servers, connected via Ethernet, InfiniBand, or similar communication fabrics. However, these commodity networks introduce tremendous latency compared to proprietary fabrics on OEM shared-memory supercomputers. Additionally, cluster computing poses challenges for application providers to comply with the strict rules of MPI and to work within the memory limitations of the SMP nodes which makeup the cluster. Despite the computational and porting overhead, the cost benefits of commodity based computing solutions make MPI clusters a staple of University and small-business research labs. Although MPI is the platform of choice for Universities and Research Labs, data-sets in bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for single node Symmetric Multi-Processing (SMP) systems and are impractical for an MPI clusters, due to the problems that arise when you decimate data-sets. The alternative is to purchase time on a National Labs shared-memory supercomputer (such as the ORNL peta-scale Cray XT4/XT5 Jaguar supercomputer). The problem with the Jaguar supercomputer option is cost, time and overkill. In short, the reliability, availability serviceability (RAS) of enterprise computing is quite different from what a researcher wants; as an example researchers and academia: • Don’t need an hardened enterprise class 9-9s reliable platform; • Do not run multiple applications concurrently and there is no need for virtualization. • Applications are single-process, multiple-thread; • Have an aversion to spending time, dollars and staff-hours needed to apply to access these National Lab machines; • Do not want to wait weeks on end in a queue to run their application; • Are willing to optimized their applications for the target hardware to get the most out of the run; • Ultimately want unencumbered 24/7 access to an affordable shared-memory machine – just like their MPI cluster. Enter Symmetric Computing The design team of Symmetric Computing came out of the research community. As such, they were very aware of the problems researcher face today and in the future. This awareness drove the development of DSMP and the need to base it on commodity hardware. Our intent is nothing short of have DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster computing. Page 4
  5. 5. How DSMP works As stated in the introduction, DSMP is software that transforms an InfiniBand connected cluster of homogeneous 1U/4P commodity servers into a shared-memory supercomputer. Although there are two unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference between them because, from the programmers perspective, there is only one OS image and one kernel. The DSMP kernel provides seven (7) enhancements that transform a cluster into a distributed symmetric multiprocessing platform, they are: 1. The shared-memory system; 2. The optimized InfiniBand driver which supports a shared-memory architecture; 3. An application driven, memory page coherency scheme; 4. An enhanced multi-threading service, based on the POSIX thread standard; 5. A distributed MuTeX; 6. A memory based distributed disk-queue and 7. A Distributed disk array. Treo™ Departmental The shared-memory system: The center piece of DSMP is its shared- Supercomputer memory architecture. For our example we will assume a three node 4P system with 64GB of physical memory per node. The three nodes are networked via 40Gb InfiniBand and there is no switch. This in fact is our value Treo™ Departmental Supercomputer product offering, shown here on the right. Figure 2 presents a macro view of the DSMP memory architecture. What become quite obvious from viewing this graphic is the application of two memory segments, i.e., local-memory and global-memory. 64GB 64GB 64GB G B 12GB 4 P0 P1 P2 P3 Global Memory “0” 16GB 16GB 16GB Local Local Local Memory Memory Global Memory “0” “3” Memory “1” “1” Global Memory “3” TX TX Figure 1 - DSMP memory architecture TX RX RX RX SMP 0 SMP 1 SMP n Page 5
  6. 6. Both coexist in the SMP physical memory and are evenly distributed over the four AMD64 processors on each of the three servers. However, the memory management unit (MMU) on the AMD Opteron™ processor sees only the local memory (as noted in blue). Local memory is statically allocated by the kernel, for our Treo™ example we will assume 1GB of local memory for every AMD64 core within the server. Hence, there are 16GB of local-memory per server or 48GB of local-memory allocated from the 192GB of available system wide memory. The remaining 144GB is global-memory, which is concurrently viewable and accessible by all 48 processor cores within the Treo™ Departmental Supercomputer. All memory (local and global) is partitioned into 4,096 byte pages or 64 AMD64 cache-lines. When there is a cache-line miss from local-memory (a page fault), the kernel identifies a least recently used (LRU) memory-page and swaps in the missing memory-page from global-memory. That happens, across the InfiniBand fabric, in just under 5µ-seconds, even faster if the page is on the same physical node. The Optimized InfiniBand Drivers: The entire success of DSMP revolved around the existence of a low latency, commercially available network fabric. It wasn’t that long ago, with the exit of Intel from InfiniBand, that the industry experts were forecasting its demise. Today InfiniBand is the fabric of choice for most High Performance Computing (HPC) clusters due to its low latency and high bandwidth. To squeeze every last nano-second of performance out of the fabric, the designer of DSMP bypassed the Linux InfiniBand protocol stack and wrote his own low-level driver. In addition, he developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel adapter (HCA). This allowed the HCA to service and move memory-pages requests, without processor intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction, reducing system-wide latency. An application driven, memory page coherency scheme: As stated in the introduction, all proprietary supercomputers maintain memory-consistency and/or coherency via hardware extension of the host processor. DSMP took a different approach for maintaining the two separated levels of coherency within the system. First there is cache-line coherency within the local SMP server. Coherency at this level is maintained by the MMU and the SMP logic native to the AMD64 processor, i.e., Cache-coherent HyperTransport™ Technology. However, global memory page coherency and consistency is controlled by, and maintained by the programmer. This approach may seem counter intuitive at first. However, the target market-segment for DSMP was technical computing not enterprise and it was assumed that the end user is familiar with the algorithm and how to optimize it for the target platform (in the same way code was optimized for a Beowulf cluster). Given the high skill level of the end users with the need to use only commodity hardware, drove system level code decisions to keep a DSMP cluster both affordable and fast. To obtain these goals, new and enhanced Linux primitives were developed. Hence, with some simple, intuitive programming rules, augmented with new primitives; porting an application to a DSMP platform (while maintaining coherency), is simple and manageable. Those rules are as follows: Page 6
  7. 7. • Be sensitive to the fact the memory-pages are swapped into and out of local memory from global memory in 4K pages and that it takes 5µ-seconds to complete the swap. • Be careful not to overlap or allocate multiple data sets within the same memory page. To help prevent this a new Alloc( ) primitive is provided to assure alignment. • Because of the way local and global memory are partitioned (within physical memory), care should be taken to distribute process/threads and associated data evenly over the four processors. In short, try not to pile-up process/threads on one processor/memory unit, but rather distribute them evenly over the system. POSIX thread primitives are provided to support the distribution of these threads. • If there is a data-set which is “modified-shared” and accessed by multiple process/threads which are on an adjacent server, then it will be necessary to use a set of new Linux primitives to maintain coherency i.e., Sync( ), Lock( ) and Release( ). Multi-Threading: The “gold standard” for parallelizing Linux C/C++ source code is with the POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two dozen or so POSIX routines were either tested to and/or modified for DSMP and the Treo™ platform. The common method for parallelizing a process is via the Fork( ) primitive. Within DSMP there is a flag associated with Fork( ). This flag determines if the forked thread is to say local (with the current process on the primary server), or run on one of the remote servers. This allows the programmer to specify, how many threads of a given process can be serviced by the head node. Simple analysis will show just how many thread can run concurrently before performance flattens out due to the memory-wall effect, or other conditions. Once this value is understood, the remote flag can be used to evenly distribute threads over all the servers within the DSMP system. By default, each successive instance of Fork( ) caused that thread to be associated with the next server in the DSMP system, in a round-robin fashion. Hence, a Fork ( ) remote of three threads on Treo™ would place the current process on each of the three servers with one thread per server. The Kernel manages the consistency of the process to ensure it executes with the same environment and associated state variables. Coherency at the memory-page level is the responsibility of the programmer. A lot of this is common sense; if a memory page is accessed by multiple threads and up-dated (modified – exclusive), then it will be necessary to hold off pending threads until the current thread has updated the page in question. To facilitate this, three DSMP Linux primitives are provided. They are Sync( ), Lock( ) and Release( ). Page 7
  8. 8. • Sync( ): as the name Implies, synchronize one (1) local private memory-page with its source global-memory page. • Lock( ): is used to prevent any other process thread from accessing and subsequently modifying the memory-page. Lock( ) also invalidates all other copies of the locked memory- page within the system. If a process thread on an adjacent server accesses a locked memory page, execution is suspended until the page is released. • Release( ): as the name implies, releases a previously locked memory page. Lastly, to insure that data structure do not overlap, a new DSMP Alloc( ) primitive is provided to force alignment for a give data-structure on a 4K boundary. This primitive assures that the end of one data-structure does not fall inside an adjacent data-structure. Distributed MuTeX: Wikipedia describes MuTeX or Mutual exclusion as a set of algorithms which are used in concurrent programming to avoid the simultaneous use of a common resource, such as a global variable or a critical sections. A distributed MuTeX is nothing more than a DSMP kernel enhancement which insures that MuTeX functions as expected within the DSMP system. From a programmers point-of-view, there are no changes or modification to MuTeX – it just works. Memory based distributed disk-queue: A new DSMP primitive D_file( ) provides a high- bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a memory (DRAM) based storage buffer which effectively eliminates bottlenecks which occur when a multiple threads compete for a low bandwidth device such as a HDD. Once the current process retires, the contents of the queue are sent to the target I/O device and the queue is released. A Distributed disk array: A distributed disk array is implemented by the kernel through enhancements made to the Linux striped volume manager. These enhancements extend the Linux volume manager over the entire network interface providing to the OS, a single consolidated drive. On Treo™ the distributed disk array is made up of six (6) 1TB drivers – two per server, for a single 6TB storage device. DSMP Performance Performance of a supercomputer is a function of two metrics: 1) Processor performance (computational throughput); 2) Global Memory Read/Write performance - which can be furthered divided down to: a. Stream performance – continuous R/W memory bandwidth and b. Random read/write performance (memory R/W latency). The extraordinary thing about the DSMP™ is the fact that it is based on commodity components. That’s important, because DSMP performance scales with the performance of the commodity components from which it is made. As an example, random read/write latency for Treo™, went down 40% with the availability of 40Gb InfiniBand. Furthermore, this move from 20Gb to a 40Gb fabric caused no appreciable increase in the cost of a Treo™ system (and no changes to the DSMP software were needed). Page 8
  9. 9. Also, within this same timeframe, AMD64 processor density went from quad-core to six-core, again without any appreciable increase in the cost of the total system. Therefore, over time the performance gap between DSMP™ shared-memory supercomputers and proprietary shared-memory systems will close. Today proprietary shared-memory system providers have intra-node bandwidth numbers in the order of 2.5GB/sec. and random access times in the order of 1µsec. That’s a difference of ~4:1 in bandwidth and ~5:1 in R/W latency over DSMP™. At first glance, this much of a disparity might appear as a disadvantage, but that is not necessarily the case - for three reasons. First: DSMP random R/W latency is based on the time it takes to move 4,096B vs. 64B or 128B in <1µsec. (for SGI and others); that’s a 64:1 or 32:1 difference in size of the cache-line or page size. In addition, the processors used in these proprietary systems might have enhanced floating-point capabilities but they might run slower, in some case, much slower than a 2.8GHz quad-core AMD Opteron™ Processor. So performance is not tied entirely to memory latency or processor performance but is a function of many system variables as-well- as the algorithm and the way the data is structured. A second and more important reason why the DSMP performance is not a problem is access. That is, having open and unencumbered 24/7 access to a shared-memory system. As an example, let’s assume it takes 24 hours to run a job on the ORNL Jaguar supercomputer with a allocation of 48 processors and 150GB of shared-memory. However, it takes months to submit the proposal and gain approved. Then there’s the additional wait in the queue of around 14 days - to access the system; typical for this type of engagement. If we assume the DSMP™ shared-memory supercomputer is 1/5 the performance of the one at Oakridge (due to memory latency, bandwidth and related factors), then it would take five times longer to get the same results – that’s 120 hours verses, 24. However, when you take into account the two week queue time, the results are available 10-days sooner. In the same time-frame, you could have run the job three times over. The third and final reason is value. Today, an entry level Treo™ departmental supercomputer costs only $49,950 - configured with 144GB of shared memory, 48 - 2.8Ghz AMD64 processor cores and 6TB of disk storage (University pricing). A comparable shared-memory platform from an OEM would approach $1,000,000 (not including maintenance and licensing fees), that’s 1/20 of the price at 1/5 the performance. With the introduction of the Treo™ departmental supercomputer, Universities and researchers have a new option which is based on the same market forces that drove the emergence of the MPI cluster i.e., commodity hardware, value and availability. Today, Symmetric Computing is offering four unique configuration of Treo™ from 48 to 72 - AMD64 cores and 144GB to 336GB of shared- memory (see table on following page). Page 9
  10. 10. Treo™ Quad-core Six-core 4GB 8GB Total P/N 2.8GHz 2.6GHz PC5300 PC5300 Shared DIMMs DIMMs Memory SCA161604-3 269 Giga-flops - 192 GB - 144GB SCA241604-3 - 374 Giga-flops 192 GB - 120GB SCA241608-3 - 374 Giga-flops - 384 GB 312GB SCA161608-3 269 Giga-flops - - 384 GB 336GB Looking forward to 1Q10, the Symmetric Computing engineering staff will introduce a 10-node blade center delivering 1.2 Tera-flops of peak throughput with 640GB or 1.28GB of system memory. In addition, we are working with our partners to deliver turn-key platform tuned for application specific missions – such as next generation sequencing, HMMER, BLAST, etc. Conclusion Symmetric Computing’s overall goal is to make supercomputing accessible and affordable to a broad range of end users. We believe that DSMP is to shared-memory computing what Beowulf/MPI was to distributed-memory computing. We are focused on delivering an affordable, commodity based technical computing solutions that services an entirely new market with – the Departmental Supercomputer. Our initial focus is to provide open applications optimized to run under DSMP and on Treo™, to accelerate scientific developments in Biosciences and Bioinformatics. We continue to expand our scope of applications and remain committed to delivering Supercomputing to the Masses. About Symmetric Computing Symmetric Computing is a Boston based software company with offices at the Venter Development Center on the campus of the University of Massachusetts – Boston. We design software to accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas, Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to delivering standards-based, customer-focused technical computing solutions for users, ranging from Universities to enterprises. For more information, visit www.symmetriccomputing.com. Page 10