Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Lisa 2015-gluster fs-introduction

1,575 views

Published on

Lisa 2015-gluster fs-introduction

Published in: Technology
  • Be the first to comment

Lisa 2015-gluster fs-introduction

  1. 1. November 8–13, 2015 | Washington, D.C. www.usenix.org/lisa15 #lisa15 GlusterFSGlusterFS A Scale-out Software Defined Storage Rajesh Joseph Poornima Gurusiddaiah
  2. 2. 05/17/16 AGENDA Introduction GlusterFS Concepts Architecture Additional Features Installation Setup and Configuration Future Work Q&A
  3. 3. 05/17/16 GlusterFS Open-source general purpose scale-out distributed file system Aggregates storage exports over network interconnect to provide a single unified namespace Layered on disk file systems that support extended attributes No meta-data server Modular Architecture for Scale and Functionality Heterogeneous commodity hardware Scalable to petabytes & beyond
  4. 4. 05/17/16 GlusterFS Concepts
  5. 5. 05/17/16 Server Intel/AMD x86 64-bit processor Disk: 8GB minimum using direct- attached-storage, RAID, Amazon EBS, and FC/Infiniband/iSCSI SAN disk backends using SATA/SAS/FC disks Memory: 1GB minimum Logical Volume Manager LVM2 with thin provisioning Networking Gigabit Ethernet 10 Gigabit Ethernet InfiniBand (OFED 1.5.2 or later) Filesystem POSIX w/ Extended Attributes (EXT4, XFS, BTRFS, …) A node is server capable of hosting GlusterFS bricks Node
  6. 6. 05/17/16 Trusted Storage Pool Also known as Cluster Trusted Storage Pool is formed by invitation – “probe” Members can be dynamically added and removed from the pool Only nodes in a Trusted Storage Pool can participate in volume creation A collection of storage servers (Node) Node 2Node 1 Node 3
  7. 7. 05/17/16 Bricks A brick is the combination of a node and an export directory – e.g. hostname:/dir Layered on posix compliant file-system (e.g. XFS, ext4) Each brick inherits limits of the underlying filesystem It is recommended to use an independent thinly provisioned LVM as brick Thin provisioning is needed by Snapshot feature A unit of storage used as a capacity building block
  8. 8. 05/17/16 Bricks Storage devices Thin LV Volume GroupBrick1 Brick2 Thin Pool mount Thin LV Thin Pool mount
  9. 9. 05/17/16 GlusterFS Volume Node hosting these bricks should be part of a single Trusted Storage Pool One or more volumes can be hosted on the same node A volume is a logical collection of one or more bricks
  10. 10. 05/17/16 GlusterFS Volume GlusterFS Volume Storage Node Storage Node Storage Node Brick Brick Brick Brick
  11. 11. 05/17/16 GlusterFS Volume Volume1 Storage Node Storage Node Storage Node Brick Brick Brick BrickBrick Brick Volume2
  12. 12. 05/17/16 Translators Translators can be stacked together for achieving desired functionality Translators are deployment agnostic – can be loaded in either the client or server stacks Building block of GlusterFS process
  13. 13. 05/17/16 Translators GigE, 10GigE – TCP/IP / Inifiniband - RDMA Client Client Client Client Network VFS Distribute Replicate Replicate Server POSIX Server POSIX Server POSIX Server POSIX I/O Cache Read Ahead Bricks Client
  14. 14. 05/17/16 GlusterFS Processes glusterd Management daemon One instance on each GlusterFS node glusterfsd GlusterFS brick daemon One process per brick on each node Managed by glusterd gluster Gluster console manager (CLI) glusterfs Node services One process for each service NFS Server, Self heal, Quota, ... mount.glusterfs FUSE native client mount extension
  15. 15. 05/17/16 Accessing Volume Volume can be mounted on local file-system Following protocols supported for accessing volume GlusterFS Native client Filesystem in Userspace (FUSE) libgfapi flexible abstracted storage Samba, NFS-Ganesha, QEMU NFS NFS Ganesha Gluster NFSv3 SMB / CIFS Windows and Linux Gluster For OpenStack Object-based access via OpenStack Swift
  16. 16. 05/17/16 Fuse Native Network Interconnect NFS Samba libgfapi NFSGanesha AppQEMU
  17. 17. 05/17/16 GlusterFS Native Client (FUSE) FUSE kernel module allows the filesystem to be built and operated entirely in userspace Specify mount to any GlusterFS server Native Client fetches volfile from mount server, then communicates directly with all nodes to access data
  18. 18. 05/17/16 Accessing Volume – Fuse client Gluster Volume Storage Node Storage Node Storage Node Brick Brick Brick Brick mount Client
  19. 19. 05/17/16 libgfapi Userspace library for accessing GlusterFS volume File-system like API No FUSE, No copies, No context switches Samba, QEMU, NFS-Ganesha integrated with libgfapi
  20. 20. 05/17/16 Libgfapi - Advantage Storage NodeClient Userspace Kernel Userspace Kernel App /mnt/gmount /dev/fuseKernel VFS Net Net Glusterfsd Brick XFS GlusterFS Fuse Library
  21. 21. 05/17/16 Libgfapi - Advantage Storage NodeClient Userspace Kernel Userspace Kernel Net Net Glusterfsd Brick XFS App libgfapi
  22. 22. 05/17/16 Gluster NFS Supports NFS v3 protocol Mount to any server or use load-balancer GlusterFS NFS server uses Network Lock Manager (NLM) to synchronize locks across clients Load balancing should be managed externally
  23. 23. 05/17/16 Accessing Volume – Gluster NFS Gluster Volume Storage Node Storage Node Storage Node Brick Brick Brick Brick mount Client NFSNFSNFS IP Load Balancer
  24. 24. 05/17/16 SMB / CIFS Samba VFS plugin for GlusterFS Uses libgfapi Must be setup on each server you wish to connect CTDB is required for Samba clustering
  25. 25. 05/17/16 Accessing Volume – SMB Gluster Volume Storage Node Storage Node Storage Node Brick Brick Brick Brick mount Client SMBDSMBDSMBD IP Load Balancer
  26. 26. 05/17/16 Volume Types Type of a volume is specified at the time of volume creation Volume type determines how and where data is placed Basic Volume Types Distributed Replicated Dispersed Striped Sharded Advance Volume Types Distributed Replicated Distributed Dispersed Distributed Striped Distributed Striped Replicated
  27. 27. 05/17/16 Distributed Volume Distributes files across various bricks of the volume Similar to file-level RAID-0 Scaling but no high availability Uses Davies-Meyer hash algorithm A 32-bit hash space is divided into N ranges for N bricks Removes the need for an external meta data server
  28. 28. 05/17/16 Distribute Volume Storage Node Storage Node Storage Node Brick Brick Brick Client File1File2 [0, a] [a + 1, b] [b + 1, c] File1 Hash = x, Where 0 <= x <= aFile2 Hash = y, Where b < y <= c
  29. 29. 05/17/16 Replicate Volume Synchronous replication of all directory and file updates Similar to file-level RAID-1 High availability but no scaling Replica pairs are decided at the volume creation time Its recommended to host each brick of replica set on different node
  30. 30. 05/17/16 Replicate Volume Storage Node Brick Client File1 Storage Node Brick
  31. 31. 05/17/16 Disperse Volume Erasure Coded volume High availability with less number of bricks Store m disk of data on k disk (k > m) n (k-m) redundant disks Following setup is the most tested configuration 6 bricks with redundancy level 2 (4 +2) 11 bricks with redundancy level 3 (8 +3) 12 bricks with redundancy level 4 (8 + 4)
  32. 32. 05/17/16 Disperse Volume Storage Node Brick Client File1 Storage Node Brick Storage Node Brick Data Part Parity / Redundancy Part
  33. 33. 05/17/16 Striped Volume Individual files split among bricks (sparse files) Similar to block-level RAID-0 Recommended only when very large files greater than the size of the disks are present Chunks are files with holes – this helps in maintaining offset consistency
  34. 34. 05/17/16 Striped Volume Storage Node Brick Client File1 Storage Node Brick Storage Node Brick
  35. 35. 05/17/16 Sharded (Stripe – 2.0) Volume Individual files split among bricks Each part split is a different file in the backend File name of each part derived from GFID (UUID) of actual file Unlike Stripe file rename will not have any impact on these parts. Unlike other volume types shard is a volume option which can be set on any volume Recommended only when very large files greater than the size of the disks are present
  36. 36. 05/17/16 Sharded Volume Storage Node Brick Client File1 Storage Node Brick Storage Node Brick GFID1.1 GFID1.2 GFID1.3
  37. 37. 05/17/16 Distribute Replicate Volume Distribute files across replicated bricks Number of bricks must be a multiple of the replica count Ordering of bricks in volume definition matters Scaling and high availability Most preferred model of deployment
  38. 38. 05/17/16 Distribute Replicate Volume Storage Node Brick Storage Node Brick Storage Node Brick Client File1File2 Storage Node Brick Replica Pairs Replica Pairs
  39. 39. 05/17/16 Performance Different SLA than a local file-system Uneven performance distribution between nodes and clients Various performance options available (e.g., read-ahead, md-cache, etc) Based on workload use various options Don't forget to attend detailed session on performance by Jeff Darcy Topic  : Evaluating Distributed File System Performance (GlusterFS and Ceph) When   : Friday, November 13, 2015 ­ 9:00am­10:30am Detail : https://www.usenix.org/conference/lisa15/conference­program/presentation/darcy
  40. 40. 05/17/16 Additional Features
  41. 41. 05/17/16 Additional Features Geo Replication Volume Snapshot Quota Bit-rot Compression Encryption at rest Trash / Recycle-bin pNFS with NFS-Ganesha Data Tiering User Serviceable Snapshots Server side Quorum Client side Quorum And many more ...
  42. 42. 05/17/16 Geo-Replication Mirrors data across geographically distributed trusted storage pools Provides back-ups of data for disaster recovery Master-Slave model Asynchronous replication Provides an incremental continuous replication service over Local Area Networks (LANs) Wide Area Network (WANs) Internet
  43. 43. 05/17/16 Geo-Replication Trusted Storage Pool Trusted Storage Pool Network LAN/WAN/Internet Master Volume Slave Volume
  44. 44. 05/17/16 GlusterFS Snapshot Volume level crash consistent snapshots LVM2 based Operates only on thin-provisioned volumes Snapshots themselves are thin-provisioned snapshot volumes A GlusterFS volume snapshot consists of snapshots of the individual bricks making up the volume Snapshot of a GlusterFS volume is also a GlusterFS volume
  45. 45. 05/17/16 Thinly Provisioned LVM2 Volume & Snapshot Storage devices SnapshotThin LV Thin Pool Volume Group
  46. 46. 05/17/16 Gluster Volume Snapshot Gluster Volume Storage Node Storage Node Brick1 Brick2 Brick3 Brick4 Brick1_s1 Brick2_s1 Brick4_s1Brick3_s1 Snapshot volume Barrier
  47. 47. 05/17/16 Data Tiering Data classification and placement of data Currently based on access rate Tiered volume can be created by attaching a tier Bricks in original volume are “cold” bricks Bricks in attached tier are “hot” bricks Hot bricks are normally hosted on faster disks e.g. SSDs Tier can be attached or detached using volume commands
  48. 48. 05/17/16 Tiered Volume Storage Node Brick Storage Node Brick Storage Node Brick Client Storage Node BrickBrick Cold Tier Hot Tier TT
  49. 49. 05/17/16 Tiered Volume Replica Tier Distribute Replica Replica Brick Brick Brick Brick Brick Brick Brick Hot Tier Cold Tier
  50. 50. 05/17/16 Future Work New Style Replication (NSR) Scalability improvements (1000+ Nodes) Multi-protocol support Multi-master Geo-Replication Caching Improvements Btrfs Support And many more…

×