• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Survey of distributed storage system

Survey of distributed storage system



This slide introduces DAS, NAS, SAN and something about object storage, storage virtualization and distributed file system.

This slide introduces DAS, NAS, SAN and something about object storage, storage virtualization and distributed file system.



Total Views
Views on SlideShare
Embed Views



3 Embeds 985

http://www.scoop.it 982
http://www.slashdocs.com 2
http://nassystem.jusst.us 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment

    Survey of distributed storage system Survey of distributed storage system Presentation Transcript

    • Survey of Distributed Storage System
    • Outline
      Storage Virtualization
      Object Storage
      Distributed File System
    • Outline
      Storage Virtualization
      Object Storage
      Distributed File System
    • Background
      As more and more digital devices(e.g. PC, laptop, ipad and smart phone) connect to the Internet, massive amount of new data are created on the web
      There were 5 exabytes of data online in 2002, which had risen to 281 exabytes in 2009, and the online data growth rate is rising faster than Moore's Law
      Then, how to store and manage these massive data effectively and efficiently ?
      An natural approach: Distributed
      Storage System!
    • Traditional Storage Architecture
      Direct Attached Storage(DAS)
      - huge management burden
      - limited number of connected host
      - severely limited data sharing
      Fabric Attached Storage
      - central system serves data to
      connected hosts
      - hosts and devices interconnected
      through Ethernet or Fibre Channel
      - NAS & SAN
    • FAS Implementations
      Network Attached Storage(NAS)
      - file-based storage architecture
      - data sharing across platforms
      - file sever can be the bottleneck
      Storage Area Networks(SAN)
      - scalable performance, high
      - limited ability of sharing data
      - unreliable security
      Since the traditional storage architectures can
      not satisfy the emerging requirement well, novel
      approaches need to be proposed !
    • Outline
      Storage Virtualization
      Object Storage
      Distributed File System
    • Storage Virtualization
      Definitions of storage virtualization by SNIA
      - the act of abstracting, hiding, or isolating the internal functions of a storage (sub)system or service from applications, computer servers, or general network resources for the purposes of enabling application and network independent management of storage or data
      - The application of virtualization to storage services or
      devices for the purpose of aggregating, hiding complexity, or adding new capabilities to lower-level storage resources
      Simply speaking, storage virtualization aggregates storage components, such as disks, controllers, and storage networks, in a coordinated way to share them more efficiently among the applications it serves!
    • Charactristics of ideal solution
      A good storage virtualization solution should:
      Enhance the storage resources it is virtualizing through the aggregation of services to increase the return of existing assets
      Not add another level of complexity in configuration and
      Improve performance rather than act as a bottleneck in
      order for it to be scalable. Scalability is the capability of a
      system to maintain performance linearly as new resources
      (typically hardware) are added
      Provide secure multi-tenancy so that users and data can
      share virtual resources without exposure to other users’
      bad behavior or mistakes
      Not be proprietary, but virtualize other vendor storage in
      the same way as its own storage to make the management
    • Types of Storage Virtualization
      Modern storage virtualization technologies can be implemented in three layers of the infrastructure
      In the server, some of the earliest forms of storage virtualization came from within the server’s operating systems
      In the storage network, network-based storage virtualization embeds the intelligence of managing the storage resources in the network layer
      In the storage controller, controller-based storage virtualization allows external storage to appear as if it’s internal
    • Server-based
      • Server-based storage virtualization is highly configurable and flexible since it’s implemented
      in the system software.
      • Because most operating systems incorporate this functionality into their system software, it is very cheap.
      • It does not require additional hardware in the storage infrastructure, and works with any devices that can be seen by the operating system.
      • Although it helps maximize the efficiency and resilience of storage resources, it’s optimized on a per-server basis only.
      • The task of mirroring, striping, and calculating parity requires additional processing, taking valuable CPU and memory resources away from the application.
      • Since every operating system implements file systems and volume management in different ways, organizations with multiple IT vendors need to maintain different skill sets and processes, with higher costs.
      • When it comes to the migration or replication of data (either locally or remotely) it becomes difficult to keep track of data protection across the entire environment.
    • Network-based
      Both in-band and out-of-band approaches provide storage virtualization with the ability to:
      • Pool heterogeneous vendor storage products in a seamless accessible pool.
      • Perform replication between non-like devices.
      • Provide a single management interface.
      Only the in-band approach can cache data for increased performance.
      Both approaches also suffer from a number of drawbacks:
      • Implementation can be very complex because the pooling of storage requires the storage extents to be remapped into virtual extents.
      • The virtualization devices are typically servers running system software and requiring as much maintenance as a regular server.
      • The I/O can suffer from latency, impacting performance and scalability due to the multiple steps required to complete the request, and limited to the amount of memory and CPU available in the appliance nodes.
      • Decoupling the virtualization from the storage once it has been implemented is impossible because all the meta-data resides in the appliance, thereby making it proprietary.
      • Solutions on the market only exist for fibre channel (FC) based SANs.
    • Controller-based
      • Connectivity to external storage assets is done via industry standard protocols, with no proprietary lock-in.
      • Complexity is reduced as it needs no additional hardware to extend the benefits of virtualization. In many cases the requirement for SAN hardware is greatly reduced.
      • Controller-based virtualization is typically cheaper than other approaches due to the ability to leverage existing SAN infrastructure, and the opportunity to consolidate
      management, replication, and availability tools.
      • Capabilities such as replication, partitioning, migration, and thin provisioning are extended to legacy storage arrays.
      • Heterogeneous data replication between non-like vendors or different storage classes reduces data protection costs.
      • Interoperability issues are reduced as the virtualized controller mimics a server connection to external storage.
      Although a few downsides to controller-based virtualization exist, the advantages not only far outweigh them but they also address most of the deficiencies found in server- and network based approaches.
    • Outline
      Storage Virtualization
      Object Storage
      Distributed File System
    • Motivation of Object Storage
      Improved device and data sharing
      - platform-dependent metadata moved to device
      Improved scalability & security
      - devices directly handle client requests
      - object security
      Improved performance
      - data types can be differentiated at the device
      Improved storage management
      - self-managed, policy-driven storage
      - storage devices become more autonomous
    • Objects in Storage
      The root object -- The OSD itself
      User object -- Created by SCSI commands from the
      application or client
      Collection object -- A group of user objects, such as all .mp3
      Partition object -- Containers that share common security and
      space managementcharacteristics
      Root Object
      (one per device)
      Partition Objects
      User Data
      Collection Objects
      User Objects(for user data)
      Object ID
    • Object Storage Device
      Two changes
      - Object-based storage offloads
      the storage component to the
      storage device
      - The device interface changes
      from blocks to objects
      System call interface
      System call interface
      File system user component
      File system user component
      File system storage component
      Object interface
      File system storage component
      Block interface
      Block I/O manager
      Block I/O manager
      Storage device
      Storage device
      Traditional model
      OSD model
    • Object Storage Architecture
      Summary of OSD Key Benefits
      ■ Better data sharing – Using objects means less metadata
      to keep coherent, which makes it possible to share the
      data across different platforms.
      ■ Better security – Unlike blocks, objects can protect
      themselves and authorize each I/O.
      ■ More intelligence – Object attributes help the storage
      devices learn about its users, the applications and the
      workloads. This leads to a variety of improvements, such
      as better data management through caching. Active disks
      can be implemented on OSDs to implement database
      filters. An intelligent OSD can also continuously reorganize
      the data, manage its own backups and deal with failures.
    • Lustre
      Lustre (Linux + Cluster)
      - first open sourced system with object storage
      - a massively parallel distributed file system
      - consist of clients, MDS and OST
      - used by fifteen of the top 30 supercomputers in the world
      A single metadata server (MDS) that has a single metadata target (MDT) per Lustrefilesystem that stores namespace metadata, such as filenames, directories, access permissions, and file layout.
      Client(s)that access and use the data, concurrent and coherent read and write access to the files are allowed
      One or more object storage servers (OSSes) that store file data on one or more object storage targets (OSTs)
    • Ceph
      Ceph is a distributed file system that provides excellent performance, reliability, and scalability based on object storage devices
      Metadata Cluster store the cluster map and control the data placement, higher-level POSIX functions (such as open, close, and rename) are managed.
    • Panasas
      Panasas (Panasas, Inc.)
      - consist of OSD, Panasas File
      System, MDS
      - claim to be the world's fastest
      HPC storage system
    • Outline
      Storage Virtualization
      Object Storage
      Distributed File System
    • Distributed File System
      A distributed file system or network file system is any file system that allows access to files from multiple hosts sharing via a computer network(Wikipedia)
      The history
      - 1st generation(1980s): NFS, AFS
      - 2nd generation(1990~1995): Tiger Shark,
      Slice File System
      - 3rd generation(1995~2000): Global File
      System, General Parallel File System, DiFFs,
      CXFS, HighRoad
      - 4th generation(2000~now): Lustre, GFSm, GlusterFS, HDFS
    • Google File System(GFS)
      GFS is a scalable distributed file system for large distributed data-intensive application in Google
      Beyond the traditional choices
      - normal component failures
      - huge files by traditional standards
      - appending new data rather than overwriting
      - co-designing the application and file system API
      GFS Interface
      - create, delete, open, close, read, write
      - snapshot & record append
      Master maintains all file system metadata, such as namespace, access control information, mapping from files to chunks and the location of chunks
      Clients interact with the master for metadata operations, but all data-bearing communication goes directly to the chunkservers
      Files are divided into fix-size(64MB) chunks, and each chunk is identified by immutable and global unique 64 bit chunk handle. Chunkservers store chunks on local disks as Linux files. In addition, each chunk is replicated on multiple chunkservers, in default, 3 replicas.
    • The client sends a write request to the primary once all the replicas have acknowledged receiving the data. The primary assigns consecutive serial numbers to all the mutations it receives and applies the mutation to its own local state in serial number order.
      Write Control and Data Flow
      The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas. If no one has, the master grants one to a replica it chooses.
      Error cases:
      Failed at the primary, it would not have been assigned a serial number and forwarded;
      Succeeded at primary and an arbitrary subset of the secondary replicas.
      The client code handles such errors by retrying the failed mutation.
      The primary forwards the write request to secondary replicas
      The client pushes the data to all replicas in any order.
      The master replies with the identity of primary and the locations of the other replicas. The client caches the information.
      The primary replies to the client.
      The secondaries all reply to the primary indicating that they have completed the operation.
    • Hadoop Distributed File System (HDFS)
      NameNode, a master server that manages the file system namespace and regulates access to files by clients.
      The Hadoop Distributed File System (HDFS) is an open source implementation of GFS
      DataNodes, manage storage attached to the nodes that they run on
      A file is split into one or more blocks and these blocks are stored in a set of DataNodes
    • Taobao File System
      Taobao File System(TFS) is a distributed file system optimized for the management of massive small files(1MB), such as pictures and descriptions of commodity
      Application/Client: access the name server & data server through TFSClient
      Name Sever: store metadata, monitor data server through heartbeat message, control IO balance, and data location info such <block id, data server>
      Data Sever: store application data, load blance, redundant backup
    • GlusterFS
      GlusterFS is an open source, clustered file system capable of scaling to several petabytes and handling thousands of clients
      Foundamental shift:
      • elimination of metadata synchronization and updates, for each individual operation Gluster calculates metadata using universal algorithms
      • effective distribution of data, file distribution is intelligently handled using elastic hash
      • highly parallel architecture, there is a far more intelligent relationship between available CPUs and spindles
    • GlusterFS(cont.)
      Gluster offers multiple ways for users to access volumes in a Gluster storage cluster
      Gluster allows to configure GlusterFS volumes in different scenarios:
      1)Distributed ,distributes
      files throughout the cluster;
      2)Distributed Replicated,
      replicates data across two or
      more nodes in the cluster;
      3) Distributed Striped, Stripes
      files across multiple nodes in
      the cluster.
    • Sheepdog
      Automatically detect removed nodes
      Sheepdog is a distributed storage system for QEMU/KVM
      - amazon EBS-like volume pool
      - highly scalable, available and reliable
      - support for advanced volume management
      - not general file system, API is designed specific to QEMU
      Zero configuration about cluster nodes
      Automatically detect added nodes
    • Sheepdog
      Volumes are divided into 4 MB objects and each object is identified by globally unique 64 bit id, and replicated to multiple nodes
      Consistent hashing is used to decide
      which node to store objects. Each node is also placed on the ring.Addition or removal of nodes does not significantly change the mapping of objects
    • Reference
      [1] A. D. Luca and M. Bhide. Storage virtualization for dummies, Hitachi Data Systems Edition. Wiley Publishing, 2010.
      [2] S. Ghemawat, H. Gobioff, and S.-T. Leung. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles 2003, SOSP 2003, Bolton Landing, NY, USA, October 19-22, 2003.
      [3] R. MacManus. The coming data explosion. Available: http://www.readwriteweb.com/archives/the_coming_data_explosion.php, 2010.
    • Reference (cont.)
      [4] Intel white paper: Object-based storage, the next wave of storage technology and devices, 2003.
      [5] M. Mesnier, G. R. Ganger and E. Riedel. Object-based storage. IEEE Communications Magazine, August 2003, 84-89.
      [6] Lustre. Available: http://wiki.lustre.org/index.php, 2010.
      [7] Panasas. Available: http://www.panasas.com/.
      [8] Hadoop. Available: http://hadoop.apache.org/.
      [9] tfs. Available: http://code.taobao.org/trac/tfs/wiki/intro.
      [10] GlusterFS. Available: http://www.gluster.org/.
    • Reference (cont.)
      [11] Sheep dog. Available: http://www.osrg.net/sheepdog/.
      [12] Ceph. Available: http://ceph.newdream.net/.
      [13] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. E. Long. Ceph: A Scalable, High-Performance Distributed File System. In Proceedings of 7th Symposium on Operating Systems Design and Implementation (OSDI '06), November 6-8, Seattle, WA, USA.
      [14] Gluster Whitepaper: Gluster file system architecture.