Z RESEARCH Inc.




                                 GlusterFS
                                 Cluster File System



                    Z RESEARCH Inc.
                    Non-stop Storage




© 2007 Z RESEARCH
Z RESEARCH Inc.
    GlusterFS Cluster File System


   GlusterFS is a Cluster File System that aggregates
   multiple storage bricks over InfiniBand RDMA into
   one large parallel network file system.


                    +             +

                                   =
                        N x Performance & Capacity


© 2007 Z RESEARCH
Z RESEARCH Inc.
    Key Design Considerations
            Capacity Scaling
                    Scalable beyond Peta Bytes
            I/O Throughput Scaling
                    Pluggable Clustered I/O Schedulers
                    Advantage of RDMA transport
            Reliability
                    Non Stop Storage
                    No Meta Data
            Ease of Manageability
                    Self Heal
                    NFS like Disk Layout
            Elegance in Design
                    Stackable Modules
                    Not tied to I/O Profiles or Hardware or OS
© 2007 Z RESEARCH
Z RESEARCH Inc.
    GlusterFS Design
               de
                                                                                                 Compatibility with
             Si
                                                                                                   MS Windows
                                       Storage Clients
          nt


                                                                                                 and other Unices
                      Cluster of Clients (Supercomputer, Data Center)
        ie
      Cl




                            GLFS Client                        GLFS Client
                                                                                                        GigE
                       ClusteredClient
                          GLFS Vol Manager                ClusteredClient
                                                             GLFS Vol Manager                                   NFS / SAMBA
                    GlusterFS Client
                     Clustered Vol Manager
                                                       GlusterFS Client
                                                        Clustered Vol Manager                                   over TCP/IP
                     Clustered I/O Scheduler
                Clustered Vol Manager                     Clustered I/O Scheduler
                                                     Clustered Vol Manager
                   Clustered I/O Scheduler              Clustered I/O Scheduler                     Storage Gateway
                                                                                                  Storage Gateway
                                                                                                Storage Gateway
                Clustered I/O Scheduler              Clustered I/O Scheduler                           NFS/Samba
                                                                                                     NFS/Samba
                                                                                                   NFS/Samba
                                                                                                      GLFS Client
                                                                                                    GLFS Client
                                                                                                  GLFS Client

                                               RDMA                                                           RDMA

                                                   InfiniBand RDMA (or) TCP/IP

                                                                      RDMA
                                 e
                              Si d




                                        GlusterFS Clustered Filesystem on x86-64 platform
                             er
                            rv




                                                                                                    GLFSD
                         Se




                                   Storage Brick 1    Storage Brick 2     Storage Brick 3   Storage Brick 4
                                                                                                  GLFSD
                                                                                                    Volume
                                       GlusterFS          GlusterFS          GlusterFS         GlusterFS
                                                                                                  Volume
                                        Volume             Volume             Volume            Volume


© 2007 Z RESEARCH
Z RESEARCH Inc.
    Stackable Design

                                         n   t
                                    l ie                            VFS
                                   C
                               S
                           e rF                                  Read Ahead
                       u st
                   l                                               I/O Cache
                  G
                                                                      Unify
      GlusterFS




                                                             GlusterFS Server
                                       Client                     Client                    Client

                                                     TCPIP – GigE, 10GigE / InfiniBand - RDMA

                              GlusterFS Server                 GlusterFS Server         GlusterFS Server

                                       Server                       Server                  Server

                                       POSIX                        POSIX                   POSIX


                                             Ext3                    Ext3                       Ext3

                                                 Brick 1               Brick 2                   Brick n
© 2007 Z RESEARCH
Z RESEARCH Inc.
    Volume Layout
                    BRICK1            BRICK2              BRICK3

                Unify Volume
                work.ods             benchmark.pdf        mylogo.xcf

                     corporate.odp        test.ogg            driver.c
                                          test.m4a
                driver.c             initcore.c           ether.c

                Mirror Volume
                accounts-2007.db     accounts-2007.db     accounts-2007.db

                     backup.db.zip       backup.db.zip        backup.db.zip

                accounts-2006.db     accounts-2006.db     accounts-2006.db

                Stripe Volume
                north-pole-map       north-pole-map       north-pole-map

                     dvd-1.iso           dvd1.iso             dvd1.iso

                xen-image            xen-image            xen-image

© 2007 Z RESEARCH
Z RESEARCH Inc.
    I/O Scheduling
       ➢Adaptive least usage (ALU)
       ➢NUFA
       ➢Random
       ➢Custom
       ➢Round robin
   volume bricks
    type cluster/unify
    subvolumes ss1c ss2c ss3c ss4c
    option scheduler alu
    option alu.limits.min-free-disk 60GB
    option alu.limits.max-open-files 10000
    option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage
    option alu.disk-usage.entry-threshold 2GB        # Units in KB, MB and GB are allowed
    option alu.disk-usage.exit-threshold 60MB        # Units in KB, MB and GB are allowed
    option alu.open-files-usage.entry-threshold 1024
    option alu.open-files-usage.exit-threshold 32
    option alu.stat-refresh.interval 10sec
   end-volume

© 2007 Z RESEARCH
Z RESEARCH Inc.
      GlusterFS Benchmarks

                                      Benchmark Environment

        Method: Multiple 'dd' of varying blocks are read and written from multiple clients
        simultaneously.

        GlusterFS Brick Configuration (16 bricks)
        Processor - Dual Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
        RAM - 8GB FB-DIMM
        Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian)
        Disk - SATA-II 500GB
        HCA - Mellanox MHGS18-XT/S InfiniBand HCA

        Client Configuration (64 clients)
        RAM - 4GB DDR2 (533 Mhz)
        Processor - Single Intel(R) Pentium(R) D CPU 3.40GHz
        Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian)
        Disk - SATA-II 500GB
        HCA - Mellanox MHGS18-XT/S InfiniBand HCA

        Interconnect Switch: Voltaire port InfiniBand Switch (14U)

        GlusterFS version 1.3.pre0-BENKI


© 2007 Z RESEARCH
Z RESEARCH Inc.
    Aggregated Bandwidth
 Aggregated I/O Benchmark on 16 bricks(servers) and 64 clients over IB Verbs transport.



                              Bps!
                           13G




    Peak aggregated read throughput -13GBps.
    After a particular threshold, write performance plateaus because of disk I/O bottleneck.
    System memory greater than the peak load will ensure best possible performance.
    ib-verbs transport driver is about 30% faster than ib-sdp transport driver.

© 2007 Z RESEARCH
Z RESEARCH Inc.
    Scalability

          Performance improves when the number of bricks are increased




   Throughput increases with corresponding increased in servers from 1 to 16
© 2007 Z RESEARCH
Z RESEARCH Inc.
                                  Hardware Platform Example
    Storage Building Block
     Intel SE5000PSL (Star Lake) baseboard
                                                        Intel EPSD
        Uses 2 (or 1) dual core LV Intel® Xeon®       McKay Creek
         (Woodcrest) processors
        Uses 1 Intel SRCSAS144e (Boiler Bay)      Add-in Card Options
                                                    Infiniband
         RAID card                                  10 GbE
        Two 2 ½” boot HDDs or boot from DOM        Fibre Channel
     2U form factor, 12 hot-swap HDDs
        3 ½” SATA 3.0Gbps (7.2k or 10k RPM)
        SAS (10k or 15k RPM)
     SES compliant enclosure management FW
     External SAS JBOD expansion




© 2007 Z RESEARCH
Z RESEARCH Inc.




                    http://www.gluster.org
                    http://www.zresearch.com



                        Thank You!


 Hitesh Chellani
 Tel: 510-354-6801
 hitesh@zresearch.com
© 2007 Z RESEARCH

Gluster Fs

  • 1.
    Z RESEARCH Inc. GlusterFS Cluster File System Z RESEARCH Inc. Non-stop Storage © 2007 Z RESEARCH
  • 2.
    Z RESEARCH Inc. GlusterFS Cluster File System GlusterFS is a Cluster File System that aggregates multiple storage bricks over InfiniBand RDMA into one large parallel network file system. + + = N x Performance & Capacity © 2007 Z RESEARCH
  • 3.
    Z RESEARCH Inc. Key Design Considerations Capacity Scaling Scalable beyond Peta Bytes I/O Throughput Scaling Pluggable Clustered I/O Schedulers Advantage of RDMA transport Reliability Non Stop Storage No Meta Data Ease of Manageability Self Heal NFS like Disk Layout Elegance in Design Stackable Modules Not tied to I/O Profiles or Hardware or OS © 2007 Z RESEARCH
  • 4.
    Z RESEARCH Inc. GlusterFS Design de Compatibility with Si MS Windows Storage Clients nt and other Unices Cluster of Clients (Supercomputer, Data Center) ie Cl GLFS Client GLFS Client GigE ClusteredClient GLFS Vol Manager ClusteredClient GLFS Vol Manager NFS / SAMBA GlusterFS Client Clustered Vol Manager GlusterFS Client Clustered Vol Manager over TCP/IP Clustered I/O Scheduler Clustered Vol Manager Clustered I/O Scheduler Clustered Vol Manager Clustered I/O Scheduler Clustered I/O Scheduler Storage Gateway Storage Gateway Storage Gateway Clustered I/O Scheduler Clustered I/O Scheduler NFS/Samba NFS/Samba NFS/Samba GLFS Client GLFS Client GLFS Client RDMA RDMA InfiniBand RDMA (or) TCP/IP RDMA e Si d GlusterFS Clustered Filesystem on x86-64 platform er rv GLFSD Se Storage Brick 1 Storage Brick 2 Storage Brick 3 Storage Brick 4 GLFSD Volume GlusterFS GlusterFS GlusterFS GlusterFS Volume Volume Volume Volume Volume © 2007 Z RESEARCH
  • 5.
    Z RESEARCH Inc. Stackable Design n t l ie VFS C S e rF Read Ahead u st l I/O Cache G Unify GlusterFS GlusterFS Server Client Client Client TCPIP – GigE, 10GigE / InfiniBand - RDMA GlusterFS Server GlusterFS Server GlusterFS Server Server Server Server POSIX POSIX POSIX Ext3 Ext3 Ext3 Brick 1 Brick 2 Brick n © 2007 Z RESEARCH
  • 6.
    Z RESEARCH Inc. Volume Layout BRICK1 BRICK2 BRICK3 Unify Volume work.ods benchmark.pdf mylogo.xcf corporate.odp test.ogg driver.c test.m4a driver.c initcore.c ether.c Mirror Volume accounts-2007.db accounts-2007.db accounts-2007.db backup.db.zip backup.db.zip backup.db.zip accounts-2006.db accounts-2006.db accounts-2006.db Stripe Volume north-pole-map north-pole-map north-pole-map dvd-1.iso dvd1.iso dvd1.iso xen-image xen-image xen-image © 2007 Z RESEARCH
  • 7.
    Z RESEARCH Inc. I/O Scheduling ➢Adaptive least usage (ALU) ➢NUFA ➢Random ➢Custom ➢Round robin volume bricks type cluster/unify subvolumes ss1c ss2c ss3c ss4c option scheduler alu option alu.limits.min-free-disk 60GB option alu.limits.max-open-files 10000 option alu.order disk-usage:read-usage:write-usage:open-files-usage:disk-speed-usage option alu.disk-usage.entry-threshold 2GB # Units in KB, MB and GB are allowed option alu.disk-usage.exit-threshold 60MB # Units in KB, MB and GB are allowed option alu.open-files-usage.entry-threshold 1024 option alu.open-files-usage.exit-threshold 32 option alu.stat-refresh.interval 10sec end-volume © 2007 Z RESEARCH
  • 8.
    Z RESEARCH Inc. GlusterFS Benchmarks Benchmark Environment Method: Multiple 'dd' of varying blocks are read and written from multiple clients simultaneously. GlusterFS Brick Configuration (16 bricks) Processor - Dual Intel(R) Xeon(R) CPU 5160 @ 3.00GHz RAM - 8GB FB-DIMM Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian) Disk - SATA-II 500GB HCA - Mellanox MHGS18-XT/S InfiniBand HCA Client Configuration (64 clients) RAM - 4GB DDR2 (533 Mhz) Processor - Single Intel(R) Pentium(R) D CPU 3.40GHz Linux Kernel - 2.6.18-5+em64t+ofed111 (Debian) Disk - SATA-II 500GB HCA - Mellanox MHGS18-XT/S InfiniBand HCA Interconnect Switch: Voltaire port InfiniBand Switch (14U) GlusterFS version 1.3.pre0-BENKI © 2007 Z RESEARCH
  • 9.
    Z RESEARCH Inc. Aggregated Bandwidth Aggregated I/O Benchmark on 16 bricks(servers) and 64 clients over IB Verbs transport. Bps! 13G Peak aggregated read throughput -13GBps. After a particular threshold, write performance plateaus because of disk I/O bottleneck. System memory greater than the peak load will ensure best possible performance. ib-verbs transport driver is about 30% faster than ib-sdp transport driver. © 2007 Z RESEARCH
  • 10.
    Z RESEARCH Inc. Scalability Performance improves when the number of bricks are increased Throughput increases with corresponding increased in servers from 1 to 16 © 2007 Z RESEARCH
  • 11.
    Z RESEARCH Inc. Hardware Platform Example Storage Building Block  Intel SE5000PSL (Star Lake) baseboard Intel EPSD  Uses 2 (or 1) dual core LV Intel® Xeon® McKay Creek (Woodcrest) processors  Uses 1 Intel SRCSAS144e (Boiler Bay) Add-in Card Options  Infiniband RAID card  10 GbE  Two 2 ½” boot HDDs or boot from DOM  Fibre Channel  2U form factor, 12 hot-swap HDDs  3 ½” SATA 3.0Gbps (7.2k or 10k RPM)  SAS (10k or 15k RPM)  SES compliant enclosure management FW  External SAS JBOD expansion © 2007 Z RESEARCH
  • 12.
    Z RESEARCH Inc. http://www.gluster.org http://www.zresearch.com Thank You! Hitesh Chellani Tel: 510-354-6801 hitesh@zresearch.com © 2007 Z RESEARCH