A Survey of Clustered Parallel    File Systems for High  Performance Computing           Clusters                 James W....
Definition of Terms●   Distributed File System - The generic term for a client/server or    "network" file system where th...
Definition of Terms●   Symmetric File Systems - A symmetric file system is one in    which the clients also host the metad...
Definition of Terms●   Cluster File System - a distributed file system that is not a single server    with a set of client...
Definition of Terms●   An important note: all of the above definitions overlap. A SAN file    system can be symmetric or a...
Object Storage Components●   An Object contains the data and enough additional information to allow the    data to be auto...
Storage Objects●   Each file or directory can be thought of as an object. As    with all objects, storage objects have att...
Panasas●   Within the storage device, all    objects are accessed via a 96-bit    object ID. The object is accessed    bas...
Panasas●   The “User” object is a container for data and two types of attributes:    ●   Application Data is essentially t...
Panasas●   The Panasas concept of object    storage is implemented entirely    in hardware.●   The Panasas ActiveScale Fil...
Panasas Performance●   Random I/O - SPECsfs97_R1.v3 as measured by Standard    Performance Evaluation Corporation (www.spe...
Lustre●   Lustre is an open, standards-based technology that runs on commodity    hardware and uses object-based disks for...
Lustre●   Lustre supports strong file and metadata locking    semantics to maintain coherency of the file systems even    ...
Lustre●   Lustre provides security in the form of authentication,    authorization and privacy by leveraging existing secu...
Lustre●   Lustre technology is    designed to scale while    maintaining resiliency.    ●   As servers are added to a     ...
Lustre File System Abstractions●   The Lustre file system provides several    abstractions designed to improve both    per...
Lustre Inodes, OST’s & OBD’s●   Like traditional file systems, the Lustre file system has a    unique inode for every regu...
Lustre Network Independence●   Lustre can be used over a wide variety of    networks due to its use of an open Network    ...
Lustre●   One drawback to Lustre    is that a Lustre client    cannot run on a server    that is providing OSTs.●   Lustre...
Lustre Performance●   Hewlett-Packard (HP) and Pacific Northwest National    Laboratory (PNNL) have partnered on the desig...
Luster Summary●   Lustre is a storage architecture and distributed file system that    provides significant performance, s...
Storage Aggregation●   Rather than providing scalable performance by striping    data across dedicated storage devices, st...
Parallel Virtual File System                   (PVFS2)●   Parallel Virtual File System 2 (PVFS2) is an open source    proj...
Parallel Virtual File System                 (PVFS2)●   Implicitly maintains consistency by carefully    structuring metad...
Parallel Virtual File System                  (PVFS2)●   PVFS2 shows that it is possible to build a    parallel file syste...
Parallel Virtual File System                  (PVFS2)●   PVFS2 also has native support    for flexible noncontiguous data ...
PVFS2 Stateless Architecture●   PVFS2 is designed around a stateless architecture.    ●   PVFS2 servers do not keep track ...
PVFS2 Design Choices●   These design choices enable PVFS2 to perform well in a    parallel environment, but not so well if...
PVFS2 Components●   The basic PVFS2 package consists of three    components: a server, a client, and a kernel    module.  ...
Accessing PVFS2 File Systems●   Two methods are provided for accessing PVFS2 file    systems.    ●   The first is to mount...
PVFS2 Summary●   There is no single file system that is the perfect    solution for every I/O workload, and PVFS2 is no   ...
Red Hat Global File System●   Red Hat Global File System (GFS) is an open source,    POSIX-compliant cluster file system.●...
Red Hat Global File System●   Red Hat Enterprise Linux allows organizations to utilize the default    Linux file system, E...
GFS Logical Volume Manager●   Red Hat Enterprise Linux includes the Logical Volume Manager (LVM),    which provides kernel...
GFS Multi-Pathing●   Red Hat GFS works in concert with Red Hat    Cluster Suite to provide failover of critical    computi...
GFS Enterprise Storage Options●   Although SAN and NAS have emerged as the preferred enterprise    storage approach, direc...
GFS on SAN’s●   SANs provide direct block-level    access to storage. When    deploying a SAN with the Ext3 file    system...
GFS on NFS●   In general, an NFS file server,    usually configured with local    storage, will serve file-level data    a...
GFS on iSCSI●   Combining the performance and sharing capabilities of a    SAN environment with the scalability and cost  ...
GFS on GNBD●   As an alternative to iSCSI, Red Hat Enterprise Linux provides support for    Red Hat’s Global Network Block...
GFS Summary●   Enterprises can now deploy large sets of open    source, commodity servers in a horizontal    scalability s...
Summary●   Panasas a clustered , asymmetric , parallel , object-based, distributed file system.     ●   Implements file sy...
Conclusions●   No single clustered parallel file system can address the    requirements of every environment.●   Hardware ...
Questions?  Los Alamos National Laboratory
References    Panasas:   http://www.panasas.com/docs/Object_Storage_Architecture_WP.pdf    Lustre:   http://www.lustre.org...
Upcoming SlideShare
Loading in …5
×

Survey of clustered_parallel_file_systems_004_lanl.ppt

1,424 views
1,352 views

Published on

Survey of Clustered and Parallel File Systems

Published in: Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,424
On SlideShare
0
From Embeds
0
Number of Embeds
169
Actions
Shares
0
Downloads
72
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Survey of clustered_parallel_file_systems_004_lanl.ppt

  1. 1. A Survey of Clustered Parallel File Systems for High Performance Computing Clusters James W. Barker, Ph. D. Los Alamos National Laboratory Computer, Computational and Statistical Sciences Division Los Alamos National Laboratory
  2. 2. Definition of Terms● Distributed File System - The generic term for a client/server or "network" file system where the data is not locally attached to a host. ● Network File System (NFS) is the most common distributed file system currently in use.● Storage Area Network (SAN) File System – Provides a means for hosts to share Fiber Channel storage, which is traditionally separated into private physical areas bound to different hosts. A block-level metadata manager manages access to different SAN devices. A SAN File system mounts storage natively on only one node and connects all other nodes to that storage by distributing the block address of that storage to all other nodes. ● Scalability is often an issue due to the significant workload required of the metadata managers and the large network transactions required in order to access data. ● Examples include: IBM’s General Parallel File System (GPFS) and Sistina (now Red Hat) Global File System (GFS) Los Alamos National Laboratory
  3. 3. Definition of Terms● Symmetric File Systems - A symmetric file system is one in which the clients also host the metadata manager code, resulting in all nodes understanding the disk structures. ● A concern with these systems is the burden that metadata management places on the client node, serving both itself and other nodes, which can impact the ability of the client node to perform its intended computational jobs. ● Examples include IBM’s GPFS and Red Hat GFS● Asymmetric File Systems - An asymmetric file system is a file system in which there are one or more dedicated metadata managers that maintain the file system and its associated disk structures. ● Examples include Panasas ActiveScale, Lustre and traditional NFS file systems. Los Alamos National Laboratory
  4. 4. Definition of Terms● Cluster File System - a distributed file system that is not a single server with a set of clients, but a cluster of servers that all work together to provide high performance storage service to their clients. ● To the clients the cluster file system is transparent, it is simply "the file system", but the file system software manages distributing requests to elements of the storage cluster. ● Examples include: Hewlett-Packard Tru64 cluster and Panasas ActiveScale● Parallel File System - a parallel file system is one in which data blocks are striped, in parallel, across multiple storage devices on multiple storage servers. Support for parallel applications is provided allowing all nodes access to the same files at the same time, thus providing concurrent read and write capabilities. ● Network Link Aggregation, another parallel file system technique, is the technology used by PVFS2, in which the I/O is spread across several network connections in parallel, each packet taking a different link path from the previous packet. ● Examples of this include: Panasas ActiveScale, Lustre, PVFS2, GPFS and GFS. Los Alamos National Laboratory
  5. 5. Definition of Terms● An important note: all of the above definitions overlap. A SAN file system can be symmetric or asymmetric. Its servers may be clustered or single servers. And it may support parallel applications or it may not. ● For example; the Panasas Storage Cluster and its ActiveScale File System (a.k.a. PanFS) is a clustered (many servers share the work), asymmetric (metadata management does not occur on the clients), parallel (supports concurrent reads and writes), object-based (not block- based) distributed (clients access storage via the network) file system. ● Another example; the Lustre File System is also a clustered, asymmetric, parallel, object-based (referred to as targets by Lustre), distributed file system. ● Another example, the Parallel Virtual File System 2 (PVFS2) is a clustered, symmetric, parallel, aggregation-based, distributed file system. ● And finally; the Red Hat Global File System (GFS) is a clustered, symmetric, parallel, block-based, distributed file system. Los Alamos National Laboratory
  6. 6. Object Storage Components● An Object contains the data and enough additional information to allow the data to be autonomous and self-managing.● An Object-based Storage Device (OSD) is an intelligent evolution of the disk drive capable of storing and serving objects rather then simply coping data to tracks and sectors. (The term OSD does not exist in Lustre) ● The term OSD in Panasas = The term OST in Lustre ● An Object-based Storage Target (OST) is an abstraction layer above the physical blocks of a physical disk (in Panasas terminology, not in Lustre). ● An Object-Based Disk (OBD) is an abstraction of the physical blocks of the physical disks (in Lustre terminology, OBD’s do not exist in Panasas terminology).● An Installable File System (IFS) integrates with compute nodes, accepts POSIX file system commands and data from the Operating System, addresses the OSD’s directly and stripes the objects across multiple OSD’s.● A Metadata Server intermediates throughout multiple compute nodes in the environment, allowing them to share data while maintaining cache consistency on all nodes.● The Network Fabric ties the compute nodes to the OSD’s and metadata servers. Los Alamos National Laboratory
  7. 7. Storage Objects● Each file or directory can be thought of as an object. As with all objects, storage objects have attributes.● Each storage object attribute can be assigned a value such as file type, file location, whether the data is striped or not, ownership, and permissions. ● An object storage device (OSD) allows us to specify for each file where to store the blocks allocated to the file, via a metadata server and object storage targets.● Extending the storage attribute further, it can also be specified how many object storage targets to stripe onto and what level of redundancy to employ on the target. ● Some implementations (Panasas) allow the specification of RAID 0 (striped) or RAID 1 (mirrored) on a per-file basis. Los Alamos National Laboratory
  8. 8. Panasas● Within the storage device, all objects are accessed via a 96-bit object ID. The object is accessed based on the object ID, the beginning of the range of bytes inside the object and the length of the byte range that is of interest (<objectID, offset, length>).● There are three different types of objects: ● The “Root” object on the storage device identifies the storage device and various attributes of the device; including total capacity and available capacity. ● A “Group” object provides a “directory” to a logical subset of the objects on the storage device. ● A ”User” object contains the actual application data to be stored. Los Alamos National Laboratory
  9. 9. Panasas● The “User” object is a container for data and two types of attributes: ● Application Data is essentially the equivalent of the data that a file would normally have in a conventional file system. It is accessed with file-like commands such as Open, Close, Read and Write. ● Storage Attributes are used by the storage device to manage the block allocation for the data. This includes the object ID, block pointers, logical length and capacity used. This is similar to the inode-level attributes inside a traditional file system. ● User Attributes are opaque to the storage device and are used by applications and metadata managers to store higher-level information about the object. ● These attributes can include; file system attributes such as ownership and access control lists (ACL’s), Quality of Service requirements that apply to a specific object and how the storage system treats a specific object (i.e., what level of RAID to apply, the size of the user’s quota or the performance characteristics required for that data). Los Alamos National Laboratory
  10. 10. Panasas● The Panasas concept of object storage is implemented entirely in hardware.● The Panasas ActiveScale File System supports two modes of data access: ● DirectFLOW is an out of band solution enabling Linux Cluster nodes to directly access data on StorageBlades in parallel. ● NFS/CIFS operates in band, utilizing the DirectorBlades as a gateway between NFS/CIFS clients and StorageBlades. Los Alamos National Laboratory
  11. 11. Panasas Performance● Random I/O - SPECsfs97_R1.v3 as measured by Standard Performance Evaluation Corporation (www.spec.org) a Panasas ActiveScale storage cluster produced a peak of 305,805 random I/O Operations/Second.● Data Throughput – as measured “in-house” by Panasas on a similarly configured cluster delivered a sustained 10.1 GBytes/ Second on sequential I/O read tests. Los Alamos National Laboratory
  12. 12. Lustre● Lustre is an open, standards-based technology that runs on commodity hardware and uses object-based disks for storage and metadata servers for file system metadata. ● This design provides an efficient division of labor between computing and storage resources.● Replicated, failover MetaData Servers (MDSs) maintain a transactional record of high-level file and file system changes.● Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with storage devices. ● File operations bypass the metadata server completely and utilize the parallel data paths to all OSTs in the cluster.● Lustre’s approach of separating metadata operations from data operations results in enhanced performance. ● The division of metadata and data operations creates a scalable file system with greater recoverability from failure conditions by providing the advantages of both journaling and distributed file systems. Los Alamos National Laboratory
  13. 13. Lustre● Lustre supports strong file and metadata locking semantics to maintain coherency of the file systems even under a high volume of concurrent access.● File locking is distributed across the Object Storage Targets (OSTs) that constitute the file system, with each OST managing locks for the objects that it stores.● Lustre uses an open networking stack composed of three layers: ● At the top of the stack is the Lustre request processing layer. ● Beneath the Lustre request processing layer is the Portals API developed by Sandia National Laboratory. ● At the bottom of the stack is the Network Abstraction Layer (NAL) which is intended to provide out-of-the-box support for multiple types of networks. Los Alamos National Laboratory
  14. 14. Lustre● Lustre provides security in the form of authentication, authorization and privacy by leveraging existing security systems. ● This eases incorporation of Lustre into existing enterprise security environments without requiring changes to Luster.● Similarly, Lustre leverages the underlying journaling file systems provided by Linux ● These journaling file systems enable persistent state recovery providing resiliency and recoverability from failed OST’s.● Finally, Lustre’s configuration and state information is recorded and managed using open standards such as XML and LDAP ● Easing the task of integrating Lustre into existing environments or third-party tools. Los Alamos National Laboratory
  15. 15. Lustre● Lustre technology is designed to scale while maintaining resiliency. ● As servers are added to a typical cluster environment, failures become more likely due to the increasing number of physical components. ● Lustre’s support for resilient, redundant hardware provides protection from inevitable hardware failures through transparent failover and recovery. Los Alamos National Laboratory
  16. 16. Lustre File System Abstractions● The Lustre file system provides several abstractions designed to improve both performance and scalability. ● At the file system level, Lustre treats files as objects that are located through metadata Servers (MDSs). ● Metadata Servers support all file system namespace operations: ● These operations include file lookups, file creation and file and directory attribute manipulation. As well as directing actual file I/O requests to Object Storage Targets (OSTs), which manage the storage that is physically located on underlying Object- Based Disks (OBDs). ● Metadata servers maintain a transactional record of file system metadata changes and cluster status, as well as supporting failover operations. Los Alamos National Laboratory
  17. 17. Lustre Inodes, OST’s & OBD’s● Like traditional file systems, the Lustre file system has a unique inode for every regular file, directory, symbolic link, and special file. ● Creating a new file causes the client to contact a metadata server, which creates an inode for the file and then contacts the OSTs to create objects that will actually hold file data. ● Metadata for the objects is held in the inode as extended attributes for the file. ● The objects allocated on OSTs hold the data associated with the file and can be striped across several OSTs in a RAID pattern. ● Within the OST, data is actually read and written to underlying storage known as Object-Based Disks (OBDs). ● Subsequent I/O to the newly created file is done directly between the client and the OST, which interacts with the underlying OBDs to read and write data. ● The metadata server is only updated when additional namespace changes associated with the new file are required. Los Alamos National Laboratory
  18. 18. Lustre Network Independence● Lustre can be used over a wide variety of networks due to its use of an open Network Abstraction Layer. Lustre is currently in use over TCP and Quadrics (QSWNet) networks. ● Myrinet, Fibre Channel, Stargen and InfiniBand support are under development. ● Lustres network-neutrality enables Lustre to quickly take advantage of performance improvements provided by network hardware and protocol improvements offered by new systems.● Lustre provides unique support for heterogeneous networks. ● For example, it is possible to connect some clients over an Ethernet to the MDS and OST servers, and others over a QSW network, in a single installation. Los Alamos National Laboratory
  19. 19. Lustre● One drawback to Lustre is that a Lustre client cannot run on a server that is providing OSTs.● Lustre has not been ported to support UNIX and Windows operating systems. ● Lustre clients can and probably will be implemented on non-Linux platforms, but as of this date, Lustre is available only on Linux. Los Alamos National Laboratory
  20. 20. Lustre Performance● Hewlett-Packard (HP) and Pacific Northwest National Laboratory (PNNL) have partnered on the design, installation, integration and support of one of the top 10 fastest computing clusters in the world.● The HP Linux super cluster, with more than 1,800 Itanium® 2 processors, is rated at more than 11 TFLOPS.● PNNL has run Lustre for more than a year and currently sustains over 3.2 GB/s of bandwidth running production loads on a 53-terabyte Lustre-based file share. ● Individual Linux clients are able to write data to the parallel Lustre servers at more than 650 MB/s. Los Alamos National Laboratory
  21. 21. Luster Summary● Lustre is a storage architecture and distributed file system that provides significant performance, scalability, and flexibility to computing clusters.● Lustre uses an object storage model for file I/O, and storage management to provide an efficient division of labor between computing and storage resources. ● Replicated, failover metadata Servers (MDSs) maintain a transactional record of high-level file and file system changes. ● Distributed Object Storage Targets (OSTs) are responsible for actual file system I/O and for interfacing with local or networked storage devices known as Object-Based Disks (OBDs).● Lustre leverages open standards such as Linux, XML, LDAP, readily available open source libraries, and existing file systems to provide a scalable, reliable distributed file system.● Lustre uses failover, replication, and recovery techniques to minimize downtime and to maximize file system availability, thereby maximizing cluster productivity. Los Alamos National Laboratory
  22. 22. Storage Aggregation● Rather than providing scalable performance by striping data across dedicated storage devices, storage aggregation provides scalable capacity by utilizing available storage blocks on each compute node.● Each compute node runs a server daemon that provides access to free space on the local disks. ● Additional software runs on each client node that combines those available blocks into a virtual device and provides locking and concurrent access to the other compute nodes. ● Each compute node could potentially be a server of blocks and a client. Using storage aggregation on a large (>1000 node) cluster, 10’s of TB of free storage could potentially be made available for use as high-performance temporary space. Los Alamos National Laboratory
  23. 23. Parallel Virtual File System (PVFS2)● Parallel Virtual File System 2 (PVFS2) is an open source project from Clemson University that provides a lightweight server daemon to provide simultaneous access to storage devices from hundreds to thousands of clients.● Each node in the cluster can be a server, a client, or both.● Since storage servers can also be clients, PVFS2 supports striping data across all available storage devices in the cluster (e.g., storage aggregation) . ● PVFS2 is best suited for providing large, fast temporary storage. Los Alamos National Laboratory
  24. 24. Parallel Virtual File System (PVFS2)● Implicitly maintains consistency by carefully structuring metadata and namespace.● Uses relaxed semantics● By defining the semantics of data access that can be achieved without locking. Los Alamos National Laboratory
  25. 25. Parallel Virtual File System (PVFS2)● PVFS2 shows that it is possible to build a parallel file system that implicitly maintains consistency by carefully structuring the metadata and name space and by defining the semantics of data access that can be achieved without locking.● This design leads to file system behavior that some traditional applications do not expect. ● These relaxed semantics are not new in the field of parallel I/O. PVFS2 closely implements the semantics dictated by MPI-IO. Los Alamos National Laboratory
  26. 26. Parallel Virtual File System (PVFS2)● PVFS2 also has native support for flexible noncontiguous data access patterns.● For example, imagine an application that reads a column of elements out of an array. To retrieve this data, the application might issue a large number of small and scattered reads to the file system.● However, if it could ask the file system for all of the noncontiguous elements in a single operation, both the file system and the application could perform more efficiently. Los Alamos National Laboratory
  27. 27. PVFS2 Stateless Architecture● PVFS2 is designed around a stateless architecture. ● PVFS2 servers do not keep track of typical file system bookkeeping information such as which files have been opened, file positions, and so on. ● There is also no shared lock state to manage.● The major advantage of a stateless architecture is that clients can fail and resume without disturbing the system as a whole.● It also allows PVFS2 to scale to hundreds of servers and thousands of clients without being impacted by the overhead and complexity of tracking file state or locking information associated with these clients. Los Alamos National Laboratory
  28. 28. PVFS2 Design Choices● These design choices enable PVFS2 to perform well in a parallel environment, but not so well if treated as a local file system. ● Without client-side caching of metadata, status operations typically take a long time, as the information is retrieved over the network. This can make programs like “ls” take longer to complete than might be expected.● PVFS2 is better suited for I/O intensive applications, rather than for hosting a home directory. ● PVFS2 is optimized for efficient reading and writing of large amounts of data, and thus it’s very well suited for scientific applications. Los Alamos National Laboratory
  29. 29. PVFS2 Components● The basic PVFS2 package consists of three components: a server, a client, and a kernel module. ● The server runs on nodes that store either file system data or metadata. ● The client and the kernel module are used by nodes that actively store or retrieve the data (or metadata) from the PVFS2 servers.● Unlike the original PVFS, each PVFS2 server can operate as a data server, a metadata server, or both simultaneously. Los Alamos National Laboratory
  30. 30. Accessing PVFS2 File Systems● Two methods are provided for accessing PVFS2 file systems. ● The first is to mount the PVFS2 file system. This lets the user change and list directories, or move files, as well as execute binaries from the file system. ● This mechanism introduces some performance overhead but is the most convenient way to access the file system interactively. ● Scientific applications use the second method, MPI-IO. ● The MPI-IO interface helps optimize access to single files by many processes on different nodes. It also provides “noncontiguous” access operations that allow for efficient access to data spread throughout the file. ● For the pattern in Figure 2 this is done by asking for every eighth element starting at offset 0 and ending at offset 56, all as one file system operation. Los Alamos National Laboratory
  31. 31. PVFS2 Summary● There is no single file system that is the perfect solution for every I/O workload, and PVFS2 is no exception.● High-performance applications rely on a different set of features to access data than those provided by typical networked file systems. ● PVFS2 is best suited for I/O-intensive applications. ● PVFS2 was not intended for home directories, but as a separate, fast, scalable file system, it is very capable. Los Alamos National Laboratory
  32. 32. Red Hat Global File System● Red Hat Global File System (GFS) is an open source, POSIX-compliant cluster file system.● Red Hat GFS executes on Red Hat Enterprise Linux servers attached to a storage area network (SAN). ● GFS runs on all major server and storage platforms supported by Red Hat.● Allows simultaneous reading and writing of blocks to a single shared file system on a SAN. ● GFS can be configured without any single points of failure. ● GFS can scale to hundreds of Red Hat Enterprise Linux servers. ● GFS is compatible with all standard Linux applications.● Supports direct I/O by databases ● Improves database performance by avoiding traditional file system overhead. Los Alamos National Laboratory
  33. 33. Red Hat Global File System● Red Hat Enterprise Linux allows organizations to utilize the default Linux file system, Ext3 (Third Extended file-system), NFS (Network File System) or Red Hats GFS cluster file system. ● Ext3 is a journaling file system, which uses log files to preserve the integrity of the file system in the event of a sudden failure. It is the standard file system used by all Red Hat Enterprise Linux systems. ● NFS is the de facto standard approach to accessing files across the network. ● GFS (Global File System) allows multiple servers to share access to the same files on a SAN while managing that access to avoid conflicts. ● Sistina Software, the original developer of GFS, was acquired by Red Hat at the end of 2003. Subsequently, Red Hat contributed GFS to the open source community under the GPL license. ● GFS is provided as a fully supported, optional layered product for Red Hat Enterprise Linux systems. Los Alamos National Laboratory
  34. 34. GFS Logical Volume Manager● Red Hat Enterprise Linux includes the Logical Volume Manager (LVM), which provides kernel-level storage virtualization capabilities. LVM supports a combination of physical storage elements into a collective storage pool, which can then be allocated and managed according to application requirements, without regard for the specifics of the underlying physical disk systems.● Initially developed by Sistina and now part of the standard the Linux kernel.● LVM provides enterprise-level volume management capabilities that are consistent with the leading, proprietary enterprise operating systems.● LVM capabilities include: ● Storage performance and availability management by allowing for the addition and removal of physical devices and through dynamic disk volume resizing. Logical volumes can be resized dynamically online. ● The Ext3 supports offline file system resizing (requiring unmount, resize, and mount operations). ● Disk system management that enables the upgrading of disks, removal of failing disks, reorganization of workloads, and adaptation of storage capacity to changing system needs Los Alamos National Laboratory
  35. 35. GFS Multi-Pathing● Red Hat GFS works in concert with Red Hat Cluster Suite to provide failover of critical computing components for high availability.● Multi-path access to storage is essential to continued availability in the event of path failure (such as failure of a Host Bus Adapter).● Red Hat Enterprise Linux’s multi-path device driver (MD driver), recognizes multiple paths to the same device, eliminating the problem of the system assuming each path leads to a different disk. ● MD driver combines the paths to a single disk, enabling failover to an alternate path if one path is disrupted. Los Alamos National Laboratory
  36. 36. GFS Enterprise Storage Options● Although SAN and NAS have emerged as the preferred enterprise storage approach, direct attached storage remains widespread throughout the enterprise. Red Hat Enterprise Linux supports the full set of enterprise storage options: ● Direct attached storage ● SCSI ● ATA ● Serial ATA ● SAS (Serial Attached SCSI) ● Networked storage ● SAN (access to block-level data over Fibre Channel or IP networks) ● NAS (access to data at the file level over IP networks) ● Storage interconnects ● Fibre Channel (FC) ● iSCSI ● GNBD (global network block device) ● NFS Los Alamos National Laboratory
  37. 37. GFS on SAN’s● SANs provide direct block-level access to storage. When deploying a SAN with the Ext3 file system, each server mounts and accesses disk partitions individually. Concurrent access is not possible. When a server shuts down or fails, the clustering software will “failover” its disk partitions so that a remaining server can mount them and resume its tasks.● Deploying GFS on SAN- connected servers allows full sharing of all file system data, concurrently. These two configuration topologies are shown in the diagram. Los Alamos National Laboratory
  38. 38. GFS on NFS● In general, an NFS file server, usually configured with local storage, will serve file-level data across a network to remote NFS clients. This topology is best suited for non-shared data files (individual users directories, for example) and is widely used in general purpose computing environments.● NFS configurations generally offer lower performance than block- based SAN environments, but they are configured using standard IP networking hardware so offer excellent scalability. They are also considerably less expensive. Los Alamos National Laboratory
  39. 39. GFS on iSCSI● Combining the performance and sharing capabilities of a SAN environment with the scalability and cost effectiveness of a NAS environment is highly desirable.● A topology that achieves this uses SAN technology to provide the core (“back end”) physical disk infrastructure, and then uses block-level IP technology to distribute served data to its eventual consumer across the network.● The emerging technology for delivering block-level data across a network is iSCSI. ● This has been developing slowly for a number of years, but as the necessary standards have stabilized, adoption by industry vendors has started to accelerate considerably. ● Red Hat Enterprise Linux currently supports iSCSI. Los Alamos National Laboratory
  40. 40. GFS on GNBD● As an alternative to iSCSI, Red Hat Enterprise Linux provides support for Red Hat’s Global Network Block Device (GNBD) protocol, which allows block-level data to be accessed over TCP/IP networks.● The combination of GNBD and GFS provides additional flexibility for sharing data on the SAN. This topology allows a GFS cluster to scale to hundreds of servers, which can concurrently mount a shared file system without the expense of including a Fibre Channel HBA and associated Fibre Channel switch port with every machine.● GNBD can make SAN data available to many other systems on the network without the expense of a Fibre Channel SAN connection. ● Today, GNBD and iSCSI offer similar capabilities, however GNBD is a mature technology while iSCSI is still relatively new. ● Red Hat provides GNBD as part of Red Hat Enterprise Linux so that customers can deploy IP network-based SANs today. ● As iSCSI matures it is expected to supplant GNBD, offering better performance and a wider range of configuration options. An example configuration is shown in the diagram that follows. Los Alamos National Laboratory
  41. 41. GFS Summary● Enterprises can now deploy large sets of open source, commodity servers in a horizontal scalability strategy and achieve the same levels of processing power for far less cost.● Such horizontal scalability can lead an organization toward utility computing, where server and storage resources are added as needed. Red Hat Enterprise Linux provides substantial server and storage flexibility; the ability to add and remove servers and storage and to redirect and reallocate storage resources dynamically. Los Alamos National Laboratory
  42. 42. Summary● Panasas a clustered , asymmetric , parallel , object-based, distributed file system. ● Implements file system entirely in hardware. ● Claims highest sustained data rate of the four systems reviewed.● Lustre a clustered, asymmetric, parallel, object-based, distributed file system. ● An open standards based system. ● Great modularity and compatibility with interconnects, networking components and storage hardware. ● Currently only available for Linux.● Parallel Virtual File System 2 (PVFS2) is a clustered, symmetric, parallel, aggregation-based, distributed file system. ● Data access is achieved without file or metadata locking ● PVFS2 is best suited for I/O-intensive (i.e., scientific) applications ● PVFS2 could be used for high-performance scratch storage where data is copied and simulation results are written from thousands of cycles simultaneously.● Red Hat Global File System (GFS) is a clustered, symmetric, parallel, block-based, distributed file system. ● An open standards based system. ● Great modularity and compatibility with interconnects, networking components and storage hardware. ● A relatively low-cost, SAN-based technology. ● Only available on Red Hat Enterprise Linux. Los Alamos National Laboratory
  43. 43. Conclusions● No single clustered parallel file system can address the requirements of every environment.● Hardware based implementations have greater throughput then software based implementations.● Standards based implementations exhibit greater modularity and flexibility in interoperating with third-party components and appear most open to the incorporation of new technology.● All implementations appear to scale well into the thousands of clients, hundreds of servers and hundreds of TB’s of storage range.● All implementations appear to address the issue of hardware and software redundancy, component failover, and avoidance of a single point of failure.● All implementations exhibit the ability to take advantage of low- latency, high-bandwidth interconnects thus avoiding the overhead associated with TCP/IP networking. Los Alamos National Laboratory
  44. 44. Questions? Los Alamos National Laboratory
  45. 45. References Panasas: http://www.panasas.com/docs/Object_Storage_Architecture_WP.pdf Lustre: http://www.lustre.org/docs/whitepaper.pdf A Next-Generation Parallel File System for Linux Clusters: http://www.pvfs.org/files/linuxworld-JAN2004-PVFS2.ps Red Hat Global File System: http://www.redhat.com/whitepapers/rha/gfs/GFS_INS0032US.pdfRed Hat Enterprise Linux: Creating a Scalable Open Source Storage Infrastructure: http://www.redhat. com/whitepapers/rhel/RHEL_creating_a_scalable_os_storage_infrastructure.pdf Exploring Clustered Parallel File Systems and Object Storage by Michael Ewan: http://www.intel.com/cd/ids/developer/asmona/eng/238284.htm?prn=Y Los Alamos National Laboratory

×