This document provides an overview of the Distributed Symmetric Multiprocessing (DSMP) software architecture. DSMP transforms an InfiniBand connected cluster of commodity servers into a shared memory supercomputer through two unique software components: 1) a host operating system that runs on the head node, and 2) a lightweight microkernel that runs on the other servers. Key aspects of DSMP include a shared memory system, optimized InfiniBand drivers, an application-driven memory page coherency scheme, enhanced multithreading support, and distributed disk storage. DSMP allows commodity clusters to provide shared memory capabilities at a lower cost than proprietary supercomputers.
This document discusses parallel computer memory architectures, including shared memory, distributed memory, and hybrid architectures. Shared memory architectures allow all processors to access a global address space and include uniform memory access (UMA) and non-uniform memory access (NUMA). Distributed memory architectures require a communication network since each processor has its own local memory without a global address space. Hybrid architectures combine shared and distributed memory by networking multiple shared memory multiprocessors.
This document discusses parallel computer memory architectures, including shared memory, distributed memory, and hybrid architectures. Shared memory architectures allow all processors to access a global address space, but lack scalability. Distributed memory assigns separate memory to each processor requiring explicit communication between tasks. Hybrid architectures combine shared memory within nodes and distributed memory between nodes for scalability.
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
Reducing memory usage is well covered in the history of this conference, yet new tricks still do exist. When optimizing memory footprint for an home gateway device, the author found some unexpected places where small changes can save valuable amount of DRAM or Flash space. This talk will visit different areas including - Kernel: fragmentation threshold, page frame reclamation task and atomic memory. Application level: Memory inefficient shared libraries due to ABI compliance and dynamic loading. Toolchain: Tuning malloc allocator parameters and compiler options. System level: General kernel might be more memory efficient than MMU-less uClinux, and preventing lock up when the system is on the brink of running out of memory.
Virtual memory is a memory management capability of an OS that uses hardware and software to allow a computer to compensate for physical memory shortages by temporarily transferring data from random access memory (RAM) to disk storage.
Process' Virtual Address Space in GNU/LinuxVarun Mahajan
The document discusses the virtual address space of a process in GNU/Linux. It explains that a process has both a user space and kernel space in virtual memory. The process' virtual address space contains text, data, and shared library segments. Functions like brk, sbrk, mmap, malloc, and free are used to allocate and free memory in the data segment to grow the process heap.
This document discusses massively parallel architectures and processing in memory (PIM) as ways to overcome the memory wall problem. It describes several PIM and cellular architectures including Cyclops, Gilgamesh, Shamrock, picoChip and DIMES. DIMES is an FPGA implementation of a simplified cellular architecture that was used by Jason McGuiness to test programming approaches. The talk concludes with an invitation for questions.
This document discusses and compares memory management systems in Windows and Linux. It covers topics like memory mapping, paging, protection, sharing memory between processes, and memory allocation strategies in both operating systems. It also analyzes the differences in how each OS distributes memory in the address space of processes.
This document discusses parallel computer memory architectures, including shared memory, distributed memory, and hybrid architectures. Shared memory architectures allow all processors to access a global address space and include uniform memory access (UMA) and non-uniform memory access (NUMA). Distributed memory architectures require a communication network since each processor has its own local memory without a global address space. Hybrid architectures combine shared and distributed memory by networking multiple shared memory multiprocessors.
This document discusses parallel computer memory architectures, including shared memory, distributed memory, and hybrid architectures. Shared memory architectures allow all processors to access a global address space, but lack scalability. Distributed memory assigns separate memory to each processor requiring explicit communication between tasks. Hybrid architectures combine shared memory within nodes and distributed memory between nodes for scalability.
Controlling Memory Footprint at All Layers: Linux Kernel, Applications, Libra...peknap
Reducing memory usage is well covered in the history of this conference, yet new tricks still do exist. When optimizing memory footprint for an home gateway device, the author found some unexpected places where small changes can save valuable amount of DRAM or Flash space. This talk will visit different areas including - Kernel: fragmentation threshold, page frame reclamation task and atomic memory. Application level: Memory inefficient shared libraries due to ABI compliance and dynamic loading. Toolchain: Tuning malloc allocator parameters and compiler options. System level: General kernel might be more memory efficient than MMU-less uClinux, and preventing lock up when the system is on the brink of running out of memory.
Virtual memory is a memory management capability of an OS that uses hardware and software to allow a computer to compensate for physical memory shortages by temporarily transferring data from random access memory (RAM) to disk storage.
Process' Virtual Address Space in GNU/LinuxVarun Mahajan
The document discusses the virtual address space of a process in GNU/Linux. It explains that a process has both a user space and kernel space in virtual memory. The process' virtual address space contains text, data, and shared library segments. Functions like brk, sbrk, mmap, malloc, and free are used to allocate and free memory in the data segment to grow the process heap.
This document discusses massively parallel architectures and processing in memory (PIM) as ways to overcome the memory wall problem. It describes several PIM and cellular architectures including Cyclops, Gilgamesh, Shamrock, picoChip and DIMES. DIMES is an FPGA implementation of a simplified cellular architecture that was used by Jason McGuiness to test programming approaches. The talk concludes with an invitation for questions.
This document discusses and compares memory management systems in Windows and Linux. It covers topics like memory mapping, paging, protection, sharing memory between processes, and memory allocation strategies in both operating systems. It also analyzes the differences in how each OS distributes memory in the address space of processes.
International Journal of Engineering Research and DevelopmentIJERD Editor
This document summarizes the design of a virtual extended memory symmetric multiprocessor (SMP) organization using LC-3 processors. It discusses the LC-3 processor architecture and instruction set. It then describes the design of a dual core LC-3 processor that shares memory over 32K bank sizes. The key components of the LC-3 processor pipeline including fetch, decode, execute, and writeback units are defined along with their inputs, outputs, and functions. Memory architectures for SMP systems including conventional, direct connect, and shared bus approaches are also summarized.
Memory management involves binding instructions and data to memory spaces using logical and physical addresses. The CPU uses base and limit registers to map the logical address space to the physical address space. Logical addresses are converted to physical addresses by adding the base register value. If a logical address is larger than the limit, an error occurs. Swapping and paging are techniques to manage memory fragmentation. Page tables implement paging by mapping logical page numbers to physical page frames. Task Manager displays memory usage and the working set of processes. NVRAM support and PFN locking help optimize memory usage. NUMA architectures scale multiprocessing by grouping CPUs and memory into nodes to reduce access latency.
NUMA (Non-Uniform Memory Access) is a computer memory design that allows for multiprocessor systems where the memory access time depends on the location of the memory relative to the processor. With NUMA, accessing some regions of memory will take longer than others. The document discusses the background of NUMA, how it impacts operating system policies and programming approaches, and provides performance comparisons between UMA (Uniform Memory Access) and NUMA architectures.
This document is a research report on the virtual memory management of Linux. It discusses several key aspects of Linux's virtual memory system:
1) Linux uses a page replacement algorithm based on a clock algorithm, which cycles through physical memory pages checking for recent access using hardware-supported reference bits. This approximates an LRU replacement strategy.
2) Techniques like demand paging, copy-on-write, and memory mapping are used to improve efficiency. Only accessed pages are loaded into memory.
3) The report focuses on page replacement algorithms and swapping/caching technology, and identifies some problems with virtual memory management.
Membase is a distributed database that provides simple, fast, and elastic key-value storage. It allows applications to scale out across commodity servers through consistent hashing and automatic rebalancing. A Membase cluster can be set up in five minutes or less with just a single node, and new nodes can be added elastically with no downtime. Membase uses the Memcached protocol and is compatible with thousands of existing applications.
The document discusses Windows memory and cache manager internals. It covers several topics:
- The virtual memory manager (VMM) abstracts physical memory to make it feel infinite to applications. It protects OS memory and enables memory sharing between applications.
- Paging is used to divide physical memory into equal size pages. Address translation uses page directories (PDEs) and page tables (PTEs) along with a translation lookaside buffer (TLB) for faster lookups.
- The cache manager improves performance by caching frequently used disk blocks in physical memory. It facilitates read-ahead and write-back caching to reduce disk access.
In this presentation, you will learn the fundamentals of Multi Processors and Multi Computers in only a few minutes.
Meanings, features, attributes, applications, and examples of multiprocessors and multi computers.
So, let's get started. If you enjoy this and find the information beneficial, please like and share it with your friends.
This document describes PowerAlluxio, an in-memory file system that improves on Alluxio by enabling shared memory utilization across cluster nodes while maintaining memory locality. PowerAlluxio allows client nodes to utilize remote node memory if local memory is full, improving cluster memory usage without sacrificing performance. It also introduces a new Smart LRU eviction policy that reduces elapsed time by 24.76% for large datasets. Experiments showed PowerAlluxio achieved up to 14.11x faster task completion times compared to Alluxio when data could be fully cached.
The document discusses various aspects of computer memory systems including main memory, cache memory, and memory mapping techniques. It provides details on:
1) Main memory stores program and data during execution and consists of addressable memory cells. Memory access time is the time for a memory operation while cycle time is the minimum delay between operations.
2) Memory units include RAM, ROM, PROM, EPROM, EEPROM and flash memory which have different characteristics like volatility and ability to be written.
3) Cache memory uses fast SRAM to improve performance by taking advantage of locality of reference where nearby memory accesses are common. Mapping techniques like direct, associative and set-associative mapping determine how
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...Anurag Deb
Virtual memory is a technique that allows processes to have more memory than is physically available by swapping parts of programs between RAM and disk as needed. This summary examines the virtual memory designs of six contemporary microprocessors, including their memory management units, address spaces, and segmentation. It also introduces 64-bit microprocessor architectures, which can theoretically address much larger memory but are currently limited by physical memory and page table sizes. Overall, the document provides an overview of virtual memory implementations and differences across modern processor architectures.
This document discusses multiple processor systems including shared-memory multiprocessors, message-passing multicomputers, and wide area distributed systems. It describes different multiprocessor architectures like UMA and NUMA and challenges like heat dissipation. It also covers topics like multiprocessing operating systems, synchronization, scheduling, and communication in multicomputer systems.
The document discusses NUMA (Non-Uniform Memory Access), a computer architecture where memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory belonging to another processor. The NUMA architecture was designed to surpass the scalability limits of Symmetric Multi-Processing (SMP) architectures by limiting the number of CPUs connected to each memory bus. Microsoft SQL Server 2005 is aware of NUMA configurations and performs well on NUMA hardware without special configuration.
This document discusses implementing Non-Uniform Memory Access (NUMA) systems. It provides background on NUMA, describing how in NUMA architectures each processor has local memory that it can access directly, while still being able to access other processors' memory through interconnects. It discusses shared memory, cache coherence challenges, and strategies for managing coherence like directory-based approaches. It includes a memory architecture block diagram, circuit diagram, memory chip algorithm and example of the memory architecture in operation.
This document discusses multiple processor systems including multiprocessors, multicomputers, and distributed systems. It covers topics such as multiprocessor hardware architectures, operating systems, scheduling, synchronization, and communication in these systems. It also discusses distributed system middleware including document-based systems like the web, file system-based systems like AFS, shared object systems like CORBA and Globe, and coordination-based systems like Linda and Jini.
The document discusses Open Virtual Platforms (OVP) software for creating virtual platforms of processors and peripherals. It provides an overview of OVP's capabilities such as instruction accurate simulation, connecting to debuggers, and efficient system modeling. Examples are given demonstrating single and multicore platforms using PowerPC processors along with an example integrating OVP models into SystemC TLM 2.0.
This document discusses Linux memory management. It outlines the buddy system, zone allocation, and slab allocator used by Linux to manage physical memory. It describes how pages are allocated and initialized at boot using the memory map. The slab allocator is used to optimize allocation of kernel objects and is implemented as caches of fixed-size slabs and objects. Per-CPU allocation improves performance by reducing locking and cache invalidations.
The document discusses NUMA (Non-Uniform Memory Access) architecture and optimization. With NUMA, memory is divided across multiple nodes and latency depends on memory location. Local memory has the lowest latency while remote memory has higher latency. The document provides examples of local and remote memory access and discusses how process-parallel and shared-memory threading applications are affected by NUMA. It also covers NUMA-aware operating system differences, techniques for process affinity, and NUMA optimization strategies like minimizing remote memory access.
Virtual memory allows treating main memory as a cache for pages on disk. It implements the separation of user logical memory from physical memory, allowing logical address spaces to be much larger than physical memory. Virtual memory can be implemented via demand paging or demand segmentation and gives advantages like higher multiprogramming and allowing very large logical address spaces by paging pages between disk and memory as needed.
International Journal of Engineering Research and DevelopmentIJERD Editor
This document summarizes the design of a virtual extended memory symmetric multiprocessor (SMP) organization using LC-3 processors. It discusses the LC-3 processor architecture and instruction set. It then describes the design of a dual core LC-3 processor that shares memory over 32K bank sizes. The key components of the LC-3 processor pipeline including fetch, decode, execute, and writeback units are defined along with their inputs, outputs, and functions. Memory architectures for SMP systems including conventional, direct connect, and shared bus approaches are also summarized.
Memory management involves binding instructions and data to memory spaces using logical and physical addresses. The CPU uses base and limit registers to map the logical address space to the physical address space. Logical addresses are converted to physical addresses by adding the base register value. If a logical address is larger than the limit, an error occurs. Swapping and paging are techniques to manage memory fragmentation. Page tables implement paging by mapping logical page numbers to physical page frames. Task Manager displays memory usage and the working set of processes. NVRAM support and PFN locking help optimize memory usage. NUMA architectures scale multiprocessing by grouping CPUs and memory into nodes to reduce access latency.
NUMA (Non-Uniform Memory Access) is a computer memory design that allows for multiprocessor systems where the memory access time depends on the location of the memory relative to the processor. With NUMA, accessing some regions of memory will take longer than others. The document discusses the background of NUMA, how it impacts operating system policies and programming approaches, and provides performance comparisons between UMA (Uniform Memory Access) and NUMA architectures.
This document is a research report on the virtual memory management of Linux. It discusses several key aspects of Linux's virtual memory system:
1) Linux uses a page replacement algorithm based on a clock algorithm, which cycles through physical memory pages checking for recent access using hardware-supported reference bits. This approximates an LRU replacement strategy.
2) Techniques like demand paging, copy-on-write, and memory mapping are used to improve efficiency. Only accessed pages are loaded into memory.
3) The report focuses on page replacement algorithms and swapping/caching technology, and identifies some problems with virtual memory management.
Membase is a distributed database that provides simple, fast, and elastic key-value storage. It allows applications to scale out across commodity servers through consistent hashing and automatic rebalancing. A Membase cluster can be set up in five minutes or less with just a single node, and new nodes can be added elastically with no downtime. Membase uses the Memcached protocol and is compatible with thousands of existing applications.
The document discusses Windows memory and cache manager internals. It covers several topics:
- The virtual memory manager (VMM) abstracts physical memory to make it feel infinite to applications. It protects OS memory and enables memory sharing between applications.
- Paging is used to divide physical memory into equal size pages. Address translation uses page directories (PDEs) and page tables (PTEs) along with a translation lookaside buffer (TLB) for faster lookups.
- The cache manager improves performance by caching frequently used disk blocks in physical memory. It facilitates read-ahead and write-back caching to reduce disk access.
In this presentation, you will learn the fundamentals of Multi Processors and Multi Computers in only a few minutes.
Meanings, features, attributes, applications, and examples of multiprocessors and multi computers.
So, let's get started. If you enjoy this and find the information beneficial, please like and share it with your friends.
This document describes PowerAlluxio, an in-memory file system that improves on Alluxio by enabling shared memory utilization across cluster nodes while maintaining memory locality. PowerAlluxio allows client nodes to utilize remote node memory if local memory is full, improving cluster memory usage without sacrificing performance. It also introduces a new Smart LRU eviction policy that reduces elapsed time by 24.76% for large datasets. Experiments showed PowerAlluxio achieved up to 14.11x faster task completion times compared to Alluxio when data could be fully cached.
The document discusses various aspects of computer memory systems including main memory, cache memory, and memory mapping techniques. It provides details on:
1) Main memory stores program and data during execution and consists of addressable memory cells. Memory access time is the time for a memory operation while cycle time is the minimum delay between operations.
2) Memory units include RAM, ROM, PROM, EPROM, EEPROM and flash memory which have different characteristics like volatility and ability to be written.
3) Cache memory uses fast SRAM to improve performance by taking advantage of locality of reference where nearby memory accesses are common. Mapping techniques like direct, associative and set-associative mapping determine how
Virtual Memory In Contemporary Microprocessors And 64-Bit Microprocessors Arc...Anurag Deb
Virtual memory is a technique that allows processes to have more memory than is physically available by swapping parts of programs between RAM and disk as needed. This summary examines the virtual memory designs of six contemporary microprocessors, including their memory management units, address spaces, and segmentation. It also introduces 64-bit microprocessor architectures, which can theoretically address much larger memory but are currently limited by physical memory and page table sizes. Overall, the document provides an overview of virtual memory implementations and differences across modern processor architectures.
This document discusses multiple processor systems including shared-memory multiprocessors, message-passing multicomputers, and wide area distributed systems. It describes different multiprocessor architectures like UMA and NUMA and challenges like heat dissipation. It also covers topics like multiprocessing operating systems, synchronization, scheduling, and communication in multicomputer systems.
The document discusses NUMA (Non-Uniform Memory Access), a computer architecture where memory access time depends on the memory location relative to the processor. Under NUMA, a processor can access its own local memory faster than non-local memory belonging to another processor. The NUMA architecture was designed to surpass the scalability limits of Symmetric Multi-Processing (SMP) architectures by limiting the number of CPUs connected to each memory bus. Microsoft SQL Server 2005 is aware of NUMA configurations and performs well on NUMA hardware without special configuration.
This document discusses implementing Non-Uniform Memory Access (NUMA) systems. It provides background on NUMA, describing how in NUMA architectures each processor has local memory that it can access directly, while still being able to access other processors' memory through interconnects. It discusses shared memory, cache coherence challenges, and strategies for managing coherence like directory-based approaches. It includes a memory architecture block diagram, circuit diagram, memory chip algorithm and example of the memory architecture in operation.
This document discusses multiple processor systems including multiprocessors, multicomputers, and distributed systems. It covers topics such as multiprocessor hardware architectures, operating systems, scheduling, synchronization, and communication in these systems. It also discusses distributed system middleware including document-based systems like the web, file system-based systems like AFS, shared object systems like CORBA and Globe, and coordination-based systems like Linda and Jini.
The document discusses Open Virtual Platforms (OVP) software for creating virtual platforms of processors and peripherals. It provides an overview of OVP's capabilities such as instruction accurate simulation, connecting to debuggers, and efficient system modeling. Examples are given demonstrating single and multicore platforms using PowerPC processors along with an example integrating OVP models into SystemC TLM 2.0.
This document discusses Linux memory management. It outlines the buddy system, zone allocation, and slab allocator used by Linux to manage physical memory. It describes how pages are allocated and initialized at boot using the memory map. The slab allocator is used to optimize allocation of kernel objects and is implemented as caches of fixed-size slabs and objects. Per-CPU allocation improves performance by reducing locking and cache invalidations.
The document discusses NUMA (Non-Uniform Memory Access) architecture and optimization. With NUMA, memory is divided across multiple nodes and latency depends on memory location. Local memory has the lowest latency while remote memory has higher latency. The document provides examples of local and remote memory access and discusses how process-parallel and shared-memory threading applications are affected by NUMA. It also covers NUMA-aware operating system differences, techniques for process affinity, and NUMA optimization strategies like minimizing remote memory access.
Virtual memory allows treating main memory as a cache for pages on disk. It implements the separation of user logical memory from physical memory, allowing logical address spaces to be much larger than physical memory. Virtual memory can be implemented via demand paging or demand segmentation and gives advantages like higher multiprogramming and allowing very large logical address spaces by paging pages between disk and memory as needed.
This document describes Symmetric Computing's Distributed Symmetric Multiprocessing (DSMP) technology, which transforms an InfiniBand-connected cluster of commodity servers into a distributed shared memory supercomputer. DSMP addresses limitations of Message Passing Interface (MPI) by enabling a global address space across cluster nodes. It features a transactional distributed shared memory system, optimized InfiniBand drivers, and an application-driven memory page coherency scheme. DSMP aims to make shared memory supercomputing affordable and accessible for researchers through leveraging commodity hardware.
This document discusses virtual memory and cache memory. It defines virtual memory as a technique that allows programs to behave as if they have contiguous memory even if the actual physical memory is fragmented. It also describes how virtual memory provides each process with its own address space and hides fragmentation. The document also defines cache memory as a small, fast memory located close to the CPU that stores frequently accessed instructions and data to improve performance. It describes levels 1 and 2 caches and how they work with memory and disk caches.
Open Storage: un nuovo modo di “PENSARE” lo Storageguest8b632d
This document discusses open storage as a new way of thinking about storage. It highlights how open storage uses open architecture, open source software, and standard hardware to provide interoperability and integrated platforms for computing, networking, and storage. Open storage offers breakthrough economics by avoiding proprietary costs and allowing anyone to develop storage solutions.
This document discusses open storage as a new way of thinking about storage. It highlights how open storage uses open architecture, open source software, and standard hardware to provide interoperability and integrated platforms for computing, networking, and storage. Open storage offers breakthrough economics by avoiding proprietary costs and allowing anyone to develop storage solutions.
This document proposes a software-defined approach called SDPM (Software-Defined Persistent Memory) to abstract the heterogeneity of emerging persistent memory technologies and enable their use across different hardware configurations. It describes SDPM's design goals of supporting various local and remote persistent memory attach points while providing a unified programming model. The proposed architecture introduces a persistent memory manager and a file system to manage data placement and provide memory-like and storage-like access. An evaluation shows the prototype delivering near-optimal performance for local and remote persistent memory configurations.
This is an old presentation salvaged from archive.
http://tree.celinuxforum.org/CelfPubWiki/JapanTechnicalJamboree13
This is English translated version. Japanese version is here.
http://www.slideshare.net/tetsu.koba/basic-of-virtual-memory-of-linux
This document discusses Spectrum Scale memory usage. It outlines Spectrum Scale basics like clusters, nodes, and filesystems. It describes the different Spectrum Scale memory pools: pagepool for data, shared segment for metadata references, and external heap for daemons. It provides information on calculating memory needs based on parameters like files to cache, stat cache size, nodes, and access patterns. Other topics covered include related Linux memory usage and out of scope memory components.
This document discusses memory organization and caching techniques. It begins by describing different types of primary memory (RAM, ROM) and secondary memory (magnetic disks, tapes, optical discs, flash). It then discusses RAM types (DRAM, SRAM), ROM types (PROM, EPROM, EEPROM), and caching principles including cache design considerations like addressing, size, mapping functions, and write policies. Memory management techniques like paging, segmentation, and virtual memory are also summarized.
This document provides an overview of the design of a dual port SRAM using Verilog HDL. It begins with an introduction describing the objectives and accomplishments of the project. It then reviews relevant literature on SRAM design. The document describes the FPGA design flow and introduces Verilog. It provides the design and operation of the SRAM, and discusses simulation results and conclusions. The proposed 8-bit dual port SRAM utilizes negative bitline techniques during write operations to improve write ability and reduce power consumption and area compared to conventional designs.
The document discusses the gap between desired behaviors and what electronic devices can actually do. Early computers attempted to assemble raw devices into purpose-built machines for each behavior. A general purpose computer helps bridge this gap by using software to organize basic electronic components and allow for varied applications.
The document summarizes the information processing cycle which includes 4 main steps: (1) input, (2) processing, (3) output, and (4) storage. It provides details about each step: input involves collecting data from various input devices, processing is performed by the CPU and involves calculations, output presents the results through output devices like monitors and printers, and storage saves data for future use through various storage devices like hard disks, solid state drives, tapes, and flash memory.
The document discusses the organization and types of system memory in a PC. It describes how the first 640KB of memory is called conventional memory and is available for programs to use. It also explains different types of additional memory areas like extended memory and cache memory, as well as different types of RAM like DRAM, SRAM, and variations of DRAM.
This document summarizes key concepts related to computer memory organization and hierarchy. It discusses how memory is organized from the fastest cache memory up to slower main memory and auxiliary storage. It covers cache mapping techniques like direct mapping, set associative mapping and associative mapping. Virtual memory and paging/segmentation techniques are also summarized. Replacement algorithms for cache memory like FIFO and LRU are discussed. The document provides an overview of computer architecture course topics and assessment patterns.
The document provides an overview of the Sciama computing environment including:
- It will be used for astrophysical modeling and analysis (SCIAMA)
- Named superusers will be required for initial testing and user training
- It will utilize Intel Xeon processors and utilize libraries like OpenMPI, OpenMP, and BLAS
- Storage will be provided by Lustre but is transient and not backed up
- It utilizes a distributed memory model in contrast to Cambridge's shared memory environments
Primary memory (RAM) is directly accessed by the CPU and is much faster than secondary memory (hard drives, SSDs). The CPU loads programs and files from secondary memory into RAM in order to process them more quickly. RAM is volatile and loses its contents when powered off, while secondary memory is non-volatile. Cache memory acts as a buffer between the CPU and RAM to further improve access speed. Virtual memory allows programs to access more memory than is physically installed by paging memory pages to and from secondary storage.
In-Memory Computing uses distributed memory systems to process and analyze large datasets in real-time, which is orders of magnitude faster than disk-based systems. It has become practical due to declining memory prices and increasing data sizes. While some myths exist that it is too expensive, not durable, or only for databases, in reality in-memory computing is applied across different workloads and provides benefits like real-time analytics, logistics, and monitoring.
The document discusses different types of computer memory including volatile memory, non-volatile memory, and registers. Volatile memory like RAM loses data when powered off, while non-volatile memory like ROM retains data without power. Virtual memory uses the hard disk to supplement RAM. Registers are high-speed memory areas within the CPU used to store data and instructions during processing. The document describes memory types like SRAM, DRAM, ROM, and flash memory as well as register functions.
The document discusses memory architectures including Harvard and Von Neumann architectures. It provides details on each:
- The Harvard architecture has separate memory for instructions and data allowing parallel access, but is more expensive.
- The Von Neumann architecture shares memory for instructions and data through a common bus, making it simpler but limiting parallelism.
- It also discusses memory types like RAM, SRAM, and DRAM and cache memory levels L1 and L2. RAM is volatile memory used for main memory, while SRAM is faster but more expensive and used for caches. DRAM is cheaper than SRAM but needs refresh cycles. Caches improve performance by storing recently used data and instructions from main memory.
1. D
S
M
P
Overview of the
Distributed Symmetric Multiprocessing
Software Architecture
By Peter Robinson
Technical Marketing Manager
Symmetric Computing
Venture Development Center
University of Massachusetts - Boston
Boston MA 02125
Page 1
3. Introduction
Distributed Symmetric Multiprocessing or DSMP, is a new kernel extension or kernel
enhancement, that extends the capabilities of the legacy Linux operating system, so it can support a
scalable, shared-memory architecture over a 40Gb InfiniBand attached cluster. DSMP is comprised of
two unique software components; the host operating system (OS)
System Call Interface (SCI) which runs on the head-node and a unique lightweight micro-kernel
Process Virtual File OS which runs on all “other” servers (which make-up the cluster).
Management (PM) System (VFS) The host OS consists of a Linux image plus a new DSMP kernel,
Memory Network creating a new durative work as noted in Figure 1. The micro-kernel
Management (MM) Stack is a non-Linux based image that extends the function of the host OS
DSMP Gasket Interface over the entire cluster. These two OS images (host and micro-
kernel), are designed to run on commodity, Symmetric
Device
ARCH Drivers Multiprocessing (SMP) servers based on the AMD64 processor.
(DD) The AMD64 architecture was selected over competing
platforms for a number of reasons, the primary being
Figure 1 – Host DSMP Software architecture
price performance. Back in 2005 when we conceived
DSMP, the AMD Opteron™ Processor was the only x86 solution that supported a high
density, 4P direct connect architecture in a 1U form-factor. As of 4Q09, AMD continues to provide the
best value for 4P 1U servers and they continue to offer the only commercially viable 4P solution on the
market today.
A look at supercomputing today
Supercomputing can be divided into two camps - proprietary shared-memory systems or
commodity message passing Interface (MPI) clusters. Shared memory systems are based on commodity
processors such as the PowerPC or Itanium or the ever-popular x86 and commodity memory (DRAM
SIMMs). At the core of most shared-memory systems is a proprietary fabric. This fabric physically
extends the host processors coherency scheme over multiple nodes, providing low-latency inter-node
communication while maintaining system wide coherency. These ultra expensive, hardened shared-
memory supercomputers are designed to accommodate concurrent, enterprise or transactional processing
applications. These applications; VMware, Oracle, dbase, SAP, etc. can utilize one to 512+ processor-
cores and tera-bytes of shared-memory. Most of these applications are optimized for the host OS and the
micro-architecture of the host processor, but not for the macro architecture of the target system. Shared-
memory systems are also a great deal easer to develop applications for. In fact, rarely is it ever necessary
to modify code-sets or data-sets to run on a shared-memory system, for most SMP software plugs and
plays, which is why the shared-memory supercomputers are in such high demand.
Page 3
4. Conversely, MPI clusters are comprised entirely of commodity servers, connected via Ethernet,
InfiniBand, or similar communication fabrics. However, these commodity networks introduce tremendous
latency compared to proprietary fabrics on OEM shared-memory supercomputers. Additionally, cluster
computing poses challenges for application providers to comply with the strict rules of MPI and to work
within the memory limitations of the SMP nodes which makeup the cluster. Despite the computational
and porting overhead, the cost benefits of commodity based computing solutions make MPI clusters a
staple of University and small-business research labs.
Although MPI is the platform of choice for Universities and Research Labs, data-sets in
bioinformatics, Oil & Gas, atmospheric modeling, etc. are becoming too large for single node Symmetric
Multi-Processing (SMP) systems and are impractical for an MPI clusters, due to the problems that arise
when you decimate data-sets. The alternative is to purchase time on a National Labs shared-memory
supercomputer (such as the ORNL peta-scale Cray XT4/XT5 Jaguar supercomputer). The problem with
the Jaguar supercomputer option is cost, time and overkill. In short, the reliability, availability
serviceability (RAS) of enterprise computing is quite different from what a researcher wants; as an
example researchers and academia:
• Don’t need an hardened enterprise class 9-9s reliable platform;
• Do not run multiple applications concurrently and there is no need for virtualization.
• Applications are single-process, multiple-thread;
• Have an aversion to spending time, dollars and staff-hours needed to apply to access these
National Lab machines;
• Do not want to wait weeks on end in a queue to run their application;
• Are willing to optimized their applications for the target hardware to get the most out of the run;
• Ultimately want unencumbered 24/7 access to an affordable shared-memory machine – just like
their MPI cluster.
Enter Symmetric Computing
The design team of Symmetric Computing came out of the research community. As such, they
were very aware of the problems researcher face today and in the future. This awareness drove the
development of DSMP and the need to base it on commodity hardware. Our intent is nothing short of
have DSMP do for shared-memory supercomputing what the Beowulf project (MPI) did for cluster
computing.
Page 4
5. How DSMP works
As stated in the introduction, DSMP is software that transforms an InfiniBand connected cluster
of homogeneous 1U/4P commodity servers into a shared-memory supercomputer. Although there are two
unique kernels, (host-kernel and a micro-kernel), for this discussion, we will ignore the difference
between them because, from the programmers perspective, there is only one OS image and one kernel.
The DSMP kernel provides seven (7) enhancements that transform a cluster into a distributed symmetric
multiprocessing platform, they are:
1. The shared-memory system;
2. The optimized InfiniBand driver which supports a shared-memory architecture;
3. An application driven, memory page coherency scheme;
4. An enhanced multi-threading service, based on the POSIX thread standard;
5. A distributed MuTeX;
6. A memory based distributed disk-queue and
7. A Distributed disk array.
Treo™ Departmental
The shared-memory system: The center piece of DSMP is its shared- Supercomputer
memory architecture. For our example we will assume a three node 4P
system with 64GB of physical memory per node. The three nodes are
networked via 40Gb InfiniBand and there is no switch. This in fact is our
value Treo™ Departmental Supercomputer product offering, shown here on
the right.
Figure 2 presents a macro view of the DSMP memory architecture. What become quite obvious from
viewing this graphic is the application of two memory segments, i.e., local-memory and global-memory.
64GB 64GB 64GB
G
B
12GB
4
P0 P1 P2 P3
Global
Memory
“0”
16GB 16GB 16GB
Local Local Local
Memory Memory Global
Memory
“0” “3” Memory
“1”
“1”
Global
Memory
“3”
TX TX
Figure 1 - DSMP memory architecture
TX
RX RX RX
SMP 0 SMP 1 SMP n
Page 5
6. Both coexist in the SMP physical memory and are evenly distributed over the four AMD64 processors on
each of the three servers. However, the memory management unit (MMU) on the AMD Opteron™
processor sees only the local memory (as noted in blue). Local memory is statically allocated by the
kernel, for our Treo™ example we will assume 1GB of local memory for every AMD64 core within the
server. Hence, there are 16GB of local-memory per server or 48GB of local-memory allocated from the
192GB of available system wide memory. The remaining 144GB is global-memory, which is
concurrently viewable and accessible by all 48 processor cores within the Treo™ Departmental
Supercomputer.
All memory (local and global) is partitioned into 4,096 byte pages or 64 AMD64 cache-lines.
When there is a cache-line miss from local-memory (a page fault), the kernel identifies a least recently
used (LRU) memory-page and swaps in the missing memory-page from global-memory. That happens,
across the InfiniBand fabric, in just under 5µ-seconds, even faster if the page is on the same physical
node.
The Optimized InfiniBand Drivers: The entire success of DSMP revolved around the existence
of a low latency, commercially available network fabric. It wasn’t that long ago, with the exit of Intel
from InfiniBand, that the industry experts were forecasting its demise. Today InfiniBand is the fabric of
choice for most High Performance Computing (HPC) clusters due to its low latency and high bandwidth.
To squeeze every last nano-second of performance out of the fabric, the designer of DSMP
bypassed the Linux InfiniBand protocol stack and wrote his own low-level driver. In addition, he
developed a set of drivers that leveraged the native RDMA capabilities of the InfiniBand host channel
adapter (HCA). This allowed the HCA to service and move memory-pages requests, without processor
intervention. Hence, RDMA eliminates the overhead for message construction and deconstruction,
reducing system-wide latency.
An application driven, memory page coherency scheme: As stated in the introduction, all
proprietary supercomputers maintain memory-consistency and/or coherency via hardware extension of
the host processor. DSMP took a different approach for maintaining the two separated levels of coherency
within the system. First there is cache-line coherency within the local SMP server. Coherency at this level
is maintained by the MMU and the SMP logic native to the AMD64 processor, i.e., Cache-coherent
HyperTransport™ Technology. However, global memory page coherency and consistency is controlled
by, and maintained by the programmer. This approach may seem counter intuitive at first. However, the
target market-segment for DSMP was technical computing not enterprise and it was assumed that the end
user is familiar with the algorithm and how to optimize it for the target platform (in the same way code
was optimized for a Beowulf cluster). Given the high skill level of the end users with the need to use only
commodity hardware, drove system level code decisions to keep a DSMP cluster both affordable and fast.
To obtain these goals, new and enhanced Linux primitives were developed. Hence, with some simple,
intuitive programming rules, augmented with new primitives; porting an application to a DSMP platform
(while maintaining coherency), is simple and manageable. Those rules are as follows:
Page 6
7. • Be sensitive to the fact the memory-pages are swapped into and out of local memory from global
memory in 4K pages and that it takes 5µ-seconds to complete the swap.
• Be careful not to overlap or allocate multiple data sets within the same memory page. To help
prevent this a new Alloc( ) primitive is provided to assure alignment.
• Because of the way local and global memory are partitioned (within physical memory), care
should be taken to distribute process/threads and associated data evenly over the four processors.
In short, try not to pile-up process/threads on one processor/memory unit, but rather distribute
them evenly over the system. POSIX thread primitives are provided to support the distribution of
these threads.
• If there is a data-set which is “modified-shared” and accessed by multiple process/threads which
are on an adjacent server, then it will be necessary to use a set of new Linux primitives
to maintain coherency i.e., Sync( ), Lock( ) and Release( ).
Multi-Threading: The “gold standard” for parallelizing Linux C/C++ source code is with the
POSIX thread library or Pthreads. POSIX is an acronym for Portable Operating System Interface. The
latest version; POSIX.1 - IEEE Std 1003.1, 2004 Edition, was developed by The Austin Common
Standards Revision Group (CSRG). To ensure that Pthreads would work with DSMP each of the two
dozen or so POSIX routines were either tested to and/or modified for DSMP and the Treo™ platform.
The common method for parallelizing a process is via the Fork( ) primitive. Within DSMP there
is a flag associated with Fork( ). This flag determines if the forked thread is to say local (with the current
process on the primary server), or run on one of the remote servers. This allows the programmer to
specify, how many threads of a given process can be serviced by the head node. Simple analysis will
show just how many thread can run concurrently before performance flattens out due to the memory-wall
effect, or other conditions. Once this value is understood, the remote flag can be used to evenly distribute
threads over all the servers within the DSMP system. By default, each successive instance of Fork( )
caused that thread to be associated with the next server in the DSMP system, in a round-robin fashion.
Hence, a Fork ( ) remote of three threads on Treo™ would place the current process on each of the three
servers with one thread per server. The Kernel manages the consistency of the process to ensure it
executes with the same environment and associated state variables.
Coherency at the memory-page level is the responsibility of the programmer. A lot of this is
common sense; if a memory page is accessed by multiple threads and up-dated (modified – exclusive),
then it will be necessary to hold off pending threads until the current thread has updated the page in
question. To facilitate this, three DSMP Linux primitives are provided. They are Sync( ), Lock( ) and
Release( ).
Page 7
8. • Sync( ): as the name Implies, synchronize one (1) local private memory-page with its
source global-memory page.
• Lock( ): is used to prevent any other process thread from accessing and subsequently
modifying the memory-page. Lock( ) also invalidates all other copies of the locked memory-
page within the system. If a process thread on an adjacent server accesses a locked memory
page, execution is suspended until the page is released.
• Release( ): as the name implies, releases a previously locked memory page.
Lastly, to insure that data structure do not overlap, a new DSMP Alloc( ) primitive is provided to
force alignment for a give data-structure on a 4K boundary. This primitive assures that the end of one
data-structure does not fall inside an adjacent data-structure.
Distributed MuTeX: Wikipedia describes MuTeX or Mutual exclusion as a set of algorithms
which are used in concurrent programming to avoid the simultaneous use of a common resource, such as
a global variable or a critical sections. A distributed MuTeX is nothing more than a DSMP kernel
enhancement which insures that MuTeX functions as expected within the DSMP system. From a
programmers point-of-view, there are no changes or modification to MuTeX – it just works.
Memory based distributed disk-queue: A new DSMP primitive D_file( ) provides a high-
bandwidth/low-latency elastic queue for data which is intended to be written to a low bandwidth
interface, such as a Hard Disk Drive (HDD) or the network. This distributed input/output queue, is a
memory (DRAM) based storage buffer which effectively eliminates bottlenecks which occur when a
multiple threads compete for a low bandwidth device such as a HDD. Once the current process retires, the
contents of the queue are sent to the target I/O device and the queue is released.
A Distributed disk array: A distributed disk array is implemented by the kernel through
enhancements made to the Linux striped volume manager. These enhancements extend the Linux volume
manager over the entire network interface providing to the OS, a single consolidated drive. On Treo™ the
distributed disk array is made up of six (6) 1TB drivers – two per server, for a single 6TB storage device.
DSMP Performance
Performance of a supercomputer is a function of two metrics:
1) Processor performance (computational throughput);
2) Global Memory Read/Write performance - which can be furthered divided down to:
a. Stream performance – continuous R/W memory bandwidth and
b. Random read/write performance (memory R/W latency).
The extraordinary thing about the DSMP™ is the fact that it is based on commodity components.
That’s important, because DSMP performance scales with the performance of the commodity components
from which it is made. As an example, random read/write latency for Treo™, went down 40% with the
availability of 40Gb InfiniBand. Furthermore, this move from 20Gb to a 40Gb fabric caused no
appreciable increase in the cost of a Treo™ system (and no changes to the DSMP software were needed).
Page 8
9. Also, within this same timeframe, AMD64 processor density went from quad-core to six-core, again
without any appreciable increase in the cost of the total system. Therefore, over time the performance gap
between DSMP™ shared-memory supercomputers and proprietary shared-memory systems will close.
Today proprietary shared-memory system providers have intra-node bandwidth numbers in the
order of 2.5GB/sec. and random access times in the order of 1µsec. That’s a difference of ~4:1 in
bandwidth and ~5:1 in R/W latency over DSMP™. At first glance, this much of a disparity might appear
as a disadvantage, but that is not necessarily the case - for three reasons. First: DSMP random R/W
latency is based on the time it takes to move 4,096B vs. 64B or 128B in <1µsec. (for SGI and others);
that’s a 64:1 or 32:1 difference in size of the cache-line or page size. In addition, the processors used in
these proprietary systems might have enhanced floating-point capabilities but they might run slower, in
some case, much slower than a 2.8GHz quad-core AMD Opteron™ Processor. So performance is not tied
entirely to memory latency or processor performance but is a function of many system variables as-well-
as the algorithm and the way the data is structured.
A second and more important reason why the DSMP performance is not a problem is access.
That is, having open and unencumbered 24/7 access to a shared-memory system. As an example, let’s
assume it takes 24 hours to run a job on the ORNL Jaguar supercomputer with a allocation of 48
processors and 150GB of shared-memory. However, it takes months to submit the proposal and gain
approved. Then there’s the additional wait in the queue of around 14 days - to access the system; typical
for this type of engagement. If we assume the DSMP™ shared-memory supercomputer is 1/5 the
performance of the one at Oakridge (due to memory latency, bandwidth and related factors), then it would
take five times longer to get the same results – that’s 120 hours verses, 24. However, when you take into
account the two week queue time, the results are available 10-days sooner. In the same time-frame, you
could have run the job three times over.
The third and final reason is value. Today, an entry level Treo™ departmental supercomputer
costs only $49,950 - configured with 144GB of shared memory, 48 - 2.8Ghz AMD64 processor cores and
6TB of disk storage (University pricing). A comparable shared-memory platform from an OEM would
approach $1,000,000 (not including maintenance and licensing fees), that’s 1/20 of the price at 1/5 the
performance. With the introduction of the Treo™ departmental supercomputer, Universities and
researchers have a new option which is based on the same market forces that drove the emergence of the
MPI cluster i.e., commodity hardware, value and availability. Today, Symmetric Computing is offering
four unique configuration of Treo™ from 48 to 72 - AMD64 cores and 144GB to 336GB of shared-
memory (see table on following page).
Page 9
10. Treo™ Quad-core Six-core 4GB 8GB Total
P/N 2.8GHz 2.6GHz PC5300 PC5300 Shared
DIMMs DIMMs Memory
SCA161604-3 269 Giga-flops - 192 GB - 144GB
SCA241604-3 - 374 Giga-flops 192 GB - 120GB
SCA241608-3 - 374 Giga-flops - 384 GB 312GB
SCA161608-3 269 Giga-flops - - 384 GB 336GB
Looking forward to 1Q10, the Symmetric Computing engineering staff will introduce a 10-node
blade center delivering 1.2 Tera-flops of peak throughput with 640GB or 1.28GB of system memory. In
addition, we are working with our partners to deliver turn-key platform tuned for application specific
missions – such as next generation sequencing, HMMER, BLAST, etc.
Conclusion
Symmetric Computing’s overall goal is to make supercomputing accessible and affordable to a
broad range of end users. We believe that DSMP is to shared-memory computing what Beowulf/MPI was
to distributed-memory computing. We are focused on delivering an affordable, commodity based
technical computing solutions that services an entirely new market with – the Departmental
Supercomputer. Our initial focus is to provide open applications optimized to run under DSMP and on
Treo™, to accelerate scientific developments in Biosciences and Bioinformatics. We continue to expand
our scope of applications and remain committed to delivering Supercomputing to the Masses.
About Symmetric Computing
Symmetric Computing is a Boston based software company with offices at the Venter
Development Center on the campus of the University of Massachusetts – Boston. We design software to
accelerate the use and application of shared-memory computing systems for Bioinformatics, Oil & Gas,
Post Production Editing, Financial analysis and related fields. Symmetric Computing is dedicated to
delivering standards-based, customer-focused technical computing solutions for users, ranging from
Universities to enterprises. For more information, visit www.symmetriccomputing.com.
Page 10