GlusterFS Update and OpenStack Integration

4,678 views
4,328 views

Published on

This

Published in: Technology, Business

GlusterFS Update and OpenStack Integration

  1. 1. GlusterFS Update and OpenStack Integration v1.0 2014/05/14 Etsuji Nakai Senior Solution Architect and Cloud Evangelist Red Hat K.K.
  2. 2. 2 Contents  Recap: What is GlusterFS?  Recap: DHT architecture overview  The current status of OpenStack integration  libgfapi: Mini Tutorial
  3. 3. Recap: what is GlusterFS?
  4. 4. 4 What is GlusterFS?  GlusterFS is opensource software to create a scale-out distributed filesystem on top of commodity x86_86 servers. – It aggregates local storage of many servers into a single logical volume. – You can extend the volume just by adding more servers.  GlusterFS runs on top of Linux. You can use it wherever you can use Linux. – Physical/virtual machines in your data center. – Linux VM on public clouds.  GlusterFS provides a wide variety of APIs. – FUSE mount (using the native client.) – NFSv3 (supporting the distributed NFS lock.) – CIFS (using libgfapi native access from Samba.) – REST API (compatible with OpenStack Swift.) – Native application library (providing POSIX-like system calls.)
  5. 5. 5 Brief history of GlusterFS 2005 2011 2012 2013 2014 GlusterFS 3.3 GlusterFS 3.4 GlusterFS 3.5 http://www.slideshare.net/johnmarkorg/gluster-where-weve-been-a-history Red Hat acquisition of Gluster Inc. The early days of Gluster Inc.
  6. 6. 6 Architecture overview  The standard filesystem (typically xfs) of each storage node is used as a backend device of the logical volume. – Each file in the volume is physically stored in one of the storage nodes' filesystem, just as the same plain file seen from the client.  Hash value of the file name is used to decide the node to store it. – The metadata server storing the file location is not used in GlusterFS. file01 file02 file03 ・・・ Storage nodes file01, file02, file03 GlusterFS client The volume is seen as a single filesystem mounted on a local directory tree. Files are distributed across local filesystem of storage nodes. GlusterFS volume
  7. 7. 7 Hierarchy structure consisted of Node / Brick / Volume ・・・ Volume vol01 Filesystem mounted on /data /data/brick02 /data/brick01  Brick (Just a directory) ・・・ /data/brick02 /data/brick01 ・・・ /data/brick02 /data/brick01 A volume is created as a "bundle" of bricks which are provided by storage nodes. Node01  A single node can provide multiple bricks to create multiple volumes.  You don't need to use the same number of bricks nor the same directory name on each node.  You can add/remove bricks to extend/reduce the size of volumes. Node02 Node03
  8. 8. 8 /brick01 /brick02 /brick03 /brick04 /brick01 /brick02 /brick03 /brick04 /brick01 /brick02 /brick03 /brick04 Volume configuration examples /brick01 /brick02 /brick03 /brick04 Storage nodes Distributing files across multiple bricks. (Each file is stored in one of the bricks.) A file is replicated between the specifed brick pairs A file is split into fixed size chunks, and chunks are distributed to brikcs. Replication Striping Striping Combining the striping and replication node01 node02 node03 node04 Replication Replication Replication
  9. 9. Recap: DHT architecture
  10. 10. 10 DHT: Distributed Hash Table  Distributed has table is: – A rule for deciding a brick to store the file based on filename's hash value. – More precisely, it's just a table of bricks and corresponding hash ranges. file01 Brick1 Hash range 0〜99 Calculate the hash value of filename. Brick2 Hash range 100〜199 ・・・ 127 Stored in the brick which is responsible for this hash value. Brick3 Hash range 200〜299 The actual hash length is 32bit. 0x00000000 〜 0xFFFFFFFF Brick1 Brick2 Brick3 ・・・ Hash range 0〜99 100〜199 200〜299 DHT (Distributed Hash Table)
  11. 11. 11 DHT structure in GlusterFS  Hash tables are created for each directory in a single volume. – Two files with the same name (in different directories) are placed in different bricks. – By assigning different hash ranges for different directories, files are more evenly distributed.  The hash range of each brick (directory) is recorded in the extended attribute of the directory. Brick1 [root@gluster01 ~]# getfattr -d -m . /data/brick01/dir01 getfattr: Removing leading '/' from absolute path names # file: data/brick01/dir01 trusted.gfid=0shk2IwdFdT0yI1K7xXGNSdA== trusted.glusterfs.10d3504b-7111-467d-8d4f-d25f0b504df6.xtime=0sT+vTRwADqyI= trusted.glusterfs.dht=0sAAAAAQAAAAB//////////w== Brick1 Brick2 Brick3 ・・・ /dir01 0〜99 100〜199 200〜299 ・・・ /dir02 100〜199 400〜499 300〜399 ・・・ /dir03 500〜599 200〜299 100〜199 ・・・ ・・・ Brick2 Brick3 ・・・
  12. 12. 12 How GlusterFS client recognizes the hash table # mount -t glusterfs gluster01:/vol01 Volume "vol01" is provided by gluster01〜gluster04 gluster01 gluster02 gluster03 gluster04
  13. 13. 13 How GlusterFS client recognizes the hash table # cat /vol01/dir01/file01 gluster01 gluster02 gluster03 gluster04 The hash range of dir01 is xxx. The hash range of dir01 is yyy.
  14. 14. 14 How GlusterFS client recognizes the hash table # cat /vol01/dir01/file01 gluster01 gluster02 gluster03 gluster04 The hash range of dir01 is xxx. The hash range of dir01 is yyy. Construct the whole hash table for dir01 on memory! Brick1 Brick2 Brick3 ・・・ dir01 0〜99 100〜199 200〜299
  15. 15. 15 Translator modules  GlusterFS works with multiple translator modules. – There are modules running on clients and modules running on servers.  Each module has its own role. – Translator modules are built as shared library. – Original modules can be added as a plug-in. [root@gluster01 ~]# ls -l /usr/lib64/glusterfs/3.3.0/xlator/ total 48 drwxr-xr-x 2 root root 4096 Jun 16 15:25 cluster drwxr-xr-x 2 root root 4096 Jun 16 15:25 debug drwxr-xr-x 2 root root 4096 Jun 16 15:25 encryption drwxr-xr-x 2 root root 4096 Jun 16 15:25 features drwxr-xr-x 2 root root 4096 Jun 16 15:25 mgmt drwxr-xr-x 2 root root 4096 Jun 16 15:25 mount drwxr-xr-x 2 root root 4096 Jun 16 15:25 nfs drwxr-xr-x 2 root root 4096 Jun 16 15:25 performance drwxr-xr-x 2 root root 4096 Jun 16 15:25 protocol drwxr-xr-x 2 root root 4096 Jun 16 15:25 storage drwxr-xr-x 2 root root 4096 Jun 16 15:25 system drwxr-xr-x 3 root root 4096 Jun 16 15:25 testing DHT, replication, etc. quota, file lock, etc. caching, read ahead, etc. physical I/O
  16. 16. 16 Typical combination of translator modules io-stats md-cache quick-read io-cache read-ahead write-behind dht replicate-1 replicate-2 server brick marker index io-threads locks access-control posix server brick marker index io-threads locks access-control posix server brick marker index io-threads locks access-control posix server brick marker index io-threads locks access-control posix client-1 client-2 client-3 client-4 Client modules(*1) Server modules(*2) Brick Recording statistics information Metadata caching Data caching Handling DHT Replication Communication with servers Communication with clients Activating I/O thereads File locking ACL management Physical access to bricks Brick Brick Brick (*1) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>-fuse.vol (*2) Defined in /var/lib/glusterd/vols/<Vol>/<Vol>.<Node>.<Brick>.vol
  17. 17. 17 The past wish list for GlusterFS  Volume Snapshot (master branch)  File Snapshot (GlusterFS3.5)  On-wire compression / decompression (GlusterFS3.4)  Disk Encryption (GlusterFS3.4)  Journal based distributed GeoReplication (GlusterFS3.5)  Erasure coding (Not yet...)  Integration with OpenStack  etc... http://www.gluster.org/
  18. 18. The current status of OpenStack Integration
  19. 19. 19 Four locations you need storage system in OpenStack Swift Nova Compute Glance Application Data OS Cinder Object Store Template Image Typcally, original distributed object store using commodity x86_86 servers is used. Typcally, external hardware storege (iSCSI) is used Typcally, local storage of compute nodes is used. Typcally, Swift or NFS storage is used.
  20. 20. Using GlusterFS for Glance backend GlusterFS Cluster GlusterFS Volume GlusterFS manages scalability, redundancy and consistency. Glance Server  Just use GlusterFS volume instead of local storage. So simple.  This is actually being used in many production clusters.
  21. 21. 21 Nova Compute Cinder VM instance /dev/vdb Virtual disk Linux KVM /dev/sdX iSCSI LUN Storage box Create LUNs iSCSI SW Initiator iSCSI Target  In typical configuration, block volumes are created as LUNs in iSCSI storage boxes. Cinder operates on the management interface of the storage through the corresponding driver.  Nova Compute attaches it to the host Linux using the software initiator, then it's attached to the VM instance through KVM hypervisor. How Nova and Cinder works together
  22. 22. 22  Cinder also provides the NFS driver which uses NFS server as a storage backend. – The driver simply mounts the NFS exported directly and create disk image files in it. Compute nodes use NFS mount to access the image files. Virtual disk NFS server NFS mount ・・・ NFS mount ・・・ Nova ComputeVM instance /dev/vdb Linux KVM Cinder Using NFS driver
  23. 23. 23  There is a driver for GlusterFS distributed filesystem, too. – Currently it uses FUSE mount mechanism. This will be replaced with more optimized mechanism (libgfapi) which bypasses the FUSE layer. Cinder GlusterFS cluster FUSE mount FUSE mount ・・・ Virtual disk ・・・ Nova ComputeVM instance /dev/vdb Linux KVM Using GlusterFS driver for Cinder
  24. 24. 24  The same can work for Nova Compute. You can store running VM's OS image on locally mounted GlusterFS volume. GlusterFS cluster FUSE mount ・・・ Virtual disk ・・・ Nova ComputeVM instance /dev/vda Linux KVM GlusterFS shared volume for Nova Compute Template Image
  25. 25. 25  The FUSE mount/file based architecture is not well suited to workload for VM disk images (small random I/O).  How can we imporve it? The challenge in Cinder/Nova Compute integration
  26. 26. 26  The FUSE mount/file based architecture is not well suited to workload for VM disk images (small random I/O).  How can we imporve it? The challenge in Cinder/Nova Compute integration http://www.inktank.com/ Using Ceph? CENSORED
  27. 27. 27  "libgfapi" is an application library with which user applications can directly access GlusterFS volume via native protocol. – It reduces the overhead of FUSE architecture. GlusterFS way for qemu integration  Now qemu is integrated with libgfapi so that it can directly access disk image files placed in GlusterFS volume. – This feature is available since Havana release. FUSE mount libgfapi
  28. 28. Architecture of Swift Account Servers Maintain mappings between accounts and containers Container Servers Object Servers Maintain lists and ACLs of objects in each container. Store object contents in file system. Proxy Servers Handling REST request from clients Authentication Server DB DB File System
  29. 29. Architecture of GlusterFS with Swift API Proxy / Account / Container / Object “all in one” server & GlusterFS client Authentication Server GlusterFS Cluster One volume is used for one account Account/Container/Object Server modules retrieve required information directly from locally mounted volumes. GlusterFS Volume Volume for each account is locally mounted at: /mnt/gluster-object/AUTH_<account name> GlusterFS manages scalability, redundancy and consistency.
  30. 30. libgfapi: Mini Tutorial
  31. 31. Using libgfapi with RHEL6/CentOS6  Install development tools, and libgfapi library from EPEL repository.  Build your application with libgfapi.  That's all!  Pseudo-Posix I/O system calls are listed in the header file. – https://github.com/gluster/glusterfs/blob/release-3.5/api/src/glfs.h – file stream and mmap are not there :-( # yum install http://download.fedoraproject.org/pub/epel/6/i386/epel-release-6-8.noarch.rpm # yum groupinstall "Development Tools" # yum install glusterfs-api-devel # gcc hellogluster.c -lgfapi # ./a.out
  32. 32. "Hello, World!" with libgfapi #include <stdlib.h> #include <stdio.h> #include <string.h> #include <glusterfs/api/glfs.h> int main (int argc, char** argv) { const char *gfserver = "gluster01"; const char *gfvol = "testvol01"; int ret; glfs_t *fs; glfs_fd_t *fd; fs = glfs_new(gfvol); glfs_set_volfile_server(fs, "tcp", gfserver, 24007); ret = glfs_init (fs); if (ret) { printf( "Failed to connect server/volume: %s/%sn", gfserver, gfvol ); exit(ret); } char *greet = "Hello, Gluster!n"; fd = glfs_creat(fs, "greeting.txt", O_RDWR, 0644); glfs_write(fd, greet, strlen(greet), 0); glfs_close(fd); return 0; } type struct representing the volume (filesystem) "testvol01" Connecting to the volume. Opening a new file on the volume. Write and close the file.
  33. 33. Thank you

×