SlideShare a Scribd company logo
1 of 48
Download to read offline
Network filesystems in
heterogeneous cloud applications
Supervisor: Massimo Masera (Univesitร  di Torino, INFN)
Company Tutor: Stefano Bagnasco (INFN, TO)
Tutor: Dario Berzano (INFN, TO)



candidate:   Matteo Concas
Computing @LHC: how is the GRID
                structured?

                     ATLAS                                        - FZK            - Catania
                                                                  (Karlsruhe)

                                                                                   - Torino        (~1 GB/s)

                     CMS                                          - CNAF*
                                                                  (Bologna)
                                  15 PB/year                                       - Bari
                                  of raw data

                      ALICE                                       - ...            - ...


                                                                                   - Legnaro
                                                                  -IN2P3
                      LHCb                                        (Lyon)


                                                  Tier-0         Tier-1          Tier-2
Data are distributed over a federated network called Grid, which is hierarchically organized in Tiers.
Computing infrastructure @INFN
                     Torino
*V.M. = virtual machine
                                                                             Grid node.
                                              job submission
                                                                Batch processes: submitted jobs are
                     V.M.*
                                                                queued and executed as soon as
              V.M.                                              there are enough free resources.
                                              data retrieval
                             V.M.                               Output is stored on Grid storage
       V.M.                                                     asynchronously.
                         V.M.
          V.M.                                                            Alice Proof facility.
                                             continuous 2-way   Interactive processes: all resources
                                              communication     are allocated at the same time. Job
                                                                splitting is dynamic and results are
                                                                returned immediately to the client.
Legacy Tier-2                                re
                                               mo
Data Storage                                      te                   Generic virtual farms.
                                                       log
                                                          in    VMs can be added dynamically and
 Data storage                cloud storage                      removed as needed. End user
 Data storage                cloud storage                      doesn't know how is his/her farm is
 Data storage                cloud storage
                                                                physically structured.
                                             New generation
                                              cloud storage
Distributing and federating the storage
Introduction: Distributed storage
   โ— Aggregation of several storages:
         โ—‹ Several nodes and disks seen as one pool in the
           same LAN (Local Area Network)
         โ—‹ Many pools aggregated geographically through
           WAN โ†’ cloud storage (Wide Area Network)
         โ—‹ Concurrent access by many clients is optimized
           โ†’ โ€œclosestโ€ replica
                                            Client 1
        Site 1     LAN                                      Client ...
                                                                         Client i
Geo-replication    WAN
                                                                         Client ...
        Site 2     LAN
                                                           Client m-1
                                            Client m

                         Network filesystems are the backbone of these infrastructures
Why distributing the storage?
โ— Local disk pools:
  โ—‹ several disks: no single hard drive can be big enough โ†’
     aggregate disks
  โ—‹ several nodes: some number crunching, and network,
     required to look up and serve data โ†’ distribute the load
  โ—‹ client scalability โ†’ serve many clients
  โ—‹ on local pools, filesystem operations (r, w, mkdir, etc.) are
     synchronous

โ— Federated storage (scale is geographical):
  โ—‹ single site cannot contain all data
  โ—‹ moving job processing close to their data, not vice versa
     โ†’ distributed data โ‡” distributed computing
  โ—‹ filesystem operations are asynchronous
Distributed storage solutions
โ— Every distributed storage has:
  โ—‹ a backend which aggregates disks
  โ—‹ a frontend which serves data over a network

โ— Many solutions:
  โ—‹ Lustre, GPFS, GFS โ†’ popular in the Grid world
  โ—‹ stackable, e.g.: aggregate with Lustre, serve with
    NFS

โ— NFS is not a distributed storage โ†’ does not aggregate,
  only network
Levels of aggregation in Torino
โ—   Hardware aggregation (RAID) of hard drives โ†’ virtual block devices
    (LUN: logical unit number)

โ—   Software aggregation of block devices โ†’ each LUN is aggregated
    using Oracle Lustre:
     โ—‹ separated server to keep "file information" (MDS: metadata server)
     โ—‹ one or more servers attached to the block devices (OSS: object
        storage servers)
     โ—‹ quasi-vertical scalability โ†’ "master" server (i.e., MDS) is a bottleneck,
        can add more (hard & critical work!)

โ—   Global federation โ†’ the local filesystem is exposed through xrootd:
     โ—‹ Torino's storage is part of a global federation
     โ—‹ used by the ALICE experiment @ CERN
     โ—‹ a global, external "file catalog" knows whether a file is in Torino or not
What is GlusterFS
โ— Open source, distributed network filesystem claiming to scale
  up to several petabytes and handling many clients

โ— Horizontal scalability โ†’ distributed workload through "bricks"

โ— Reliability:
  โ—‹ elastic management โ†’ maintenance operations are
      online
  โ—‹ can add, remove, replace without stopping service
  โ—‹ rebalance โ†’ when adding a new "brick", fill to ensure
      even distribution of data
  โ—‹ self-healing on "replicated" volumes โ†’ form of automatic
      failback & failover
GlusterFS structure
โ— GlusterFS servers cross-communicate with
  no central manager โ†’ horizontal scalability

       Gluster                       Gluster                    Gluster                  Gluster
         FS         Brick              FS       Brick             FS      Brick            FS      Brick



Hypervisor                    Hypervisor                 Hypervisor               Hypervisor




                    p2p
                 connection

                                               GlusterFS volume
Stage activities
Preliminary studies
โ— Verify compatibility of GlusterFS precompiled
  packages (RPMs) on CentOS 5 and 6 for the
  production environment

โ— Packages not available for development
  versions: new functionalities tested from source
  code (e.g. Object storage)

โ— Test on virtual machines (first on local
  VirtualBox then on INFN Torino OpenNebula
  cloud)                             http://opennebula.org/
Types of benchmarks
โ— Generic stress benchmarks conducted on:
   โ—‹ Super distributed prototype
   โ—‹ Pre-existing production volumes


โ— Specific stress benchmark conducted on
  some type of GlusterFS volumes
  (e.g. replicated volume)

โ— Application specific tests:
   โ—‹ High energies physics analysis running on ROOT
     PROOF
Note
โ— Tests conducted in two different
  circumstances:

  a. storage built for the sole purpose of testing:
     the volume would be less performing than
     infrastructure ones for the benchmarks

  b. volumes of production were certainly subject to
     interferences due to concurrent processes

             "Why perform these tests?"
Motivations
โ— Verify consistency of the "release notes":
  โ†’ test all the different volume types:
   โ—‹ replicated
   โ—‹ striped
   โ—‹ distributed


โ— Test GlusterFS in a realistic environment
  โ†’ build a prototype as similar as possible to
  production infrastructure
Experimental setup
โ— GlusterFS v3.3 turned out to be stable after tests
  conducted both on VirtualBox and OpenNebula VMs
โ— Next step: build an experimental "super distributed"
  prototype: a realistic testbed environment consisting of:
  โ—‹ #40 HDDs [500 GB each]โ†’ ~20 TB (1 TBโ‰ƒ10^12 B)
  โ—‹ GlusterFS installed on every hypervisor
  โ—‹ Each hypervisor mounted 2 HDDs โ†’ 1 TB each
  โ—‹ all the hypervisors were connected each other (LAN)
โ— Software used for benchmarks: bonnie++
  โ—‹ very simple to use read/write benchmark for disks
  โ—‹ http://www.coker.com.au/bonnie++/
Striped volume
โ— used in high concurrency environments accessing
  large files (in our case ~10 GB);
โ— useful to store large data sets, if they have to be
  accessed from multiple instances.




                                               (source: www.gluster.
                                               org)
Striped volume / results




             Average          Std. Deviation     Average     Std. Deviation      Average         Std. Deviation
          Sequential Write   Sequential Write   Sequential    Sequential      Sequential Read   Sequential Read
            per Blocks         per Blocks        Rewrite     Rewrite [MB/s]     per Blocks        per Blocks
              [MB/s]              [MB/s]          [MB/s]                          [MB/s]             [MB/s]



striped        38.6               1.3             23.0            3.6             44.7               1.3
Striped volume / comments
                                                      Machine RAM
                                    Size of written   size, although
  Each test is     Software used
                                    files [MB] (at    GlusterFS doesn't
  repeated 10      is bonnie++
                                    least double      have any sort of
  times            v1.96
                                    the RAM size)     file cache


> for i in {1..10}; do bonnie++ -d$SOMEPATH -s5000 -r2500 -f; done;



  โ— Has the second best result in write (per blocks),
      and the most stable one (lowest stddev)
Replicated volume:
โ— used where high-availability and high-reliability are
  critical
โ— main task โ†’ create forms of redundancy: more
  important the data availability than high performances in
  I/O
โ— requires a great use of resources, both disk space and
  CPU usage (especially
  during the self-healing
  procedure)


               (source: www.gluster.
               org)
Replicated volume:
โ— Self healing feature: given "N" redundant
  servers, if at maximum (N-1) crash โ†’ services
  keep running on the volume โ‡ servers restored
  โ†’ get synchronized with the one(s) that didn't
  crash

โ— Self healing feature was tested turning off
  servers (even abruptly!) during I/O processes
Replicated / results




                Average      Std. Deviation    Average     Std. Deviation      Average      Std. Deviation
              Sequential      Sequential      Sequential    Sequential       Sequential      Sequential
               Write per       Write per       Rewrite     Rewrite [MB/s]     Read per        Read per
             Blocks [MB/s]   Blocks [MB/s]      [MB/s]                      Blocks [MB/s]   Blocks [MB/s]




replicated      35.5             2.5            19.1           16.1            52.2             7.1
Replicated / comments
โ— Low rates in write and the best result in read โ†’
  writes need to be synchronized, read throughput
  benefits from multiple sources

โ— very important in building stable volumes in critical
  nodes

โ— "Self healing" feature worked fine: uses all available
  cores during resynchronization process, and it does
  it online (i.e. with no service interruption, only
  slowdowns!)
Distributed volume:
โ— Files are spread across the bricks in a fashion that
  ensures uniform distribution
โ— Pure distributed volume only if redundancy is not
  required or lies elsewhere (e.g. RAID)
โ— If no redundancy, disk/server failure can result in
  loss of data, but only
  some bricks are
  affected, not the
  whole volume!


              (source: www.gluster.
              org)
Distributed / results




                 Average      Std. Deviation    Average     Std. Deviation      Average      Std. Deviation
               Sequential      Sequential      Sequential    Sequential       Sequential      Sequential
                Write per       Write per       Rewrite     Rewrite [MB/s]     Read per        Read per
              Blocks [MB/s]   Blocks [MB/s]      [MB/s]                      Blocks [MB/s]   Blocks [MB/s]




distributed      39.8             5.4            22.3            2.8            52.1             2.2
Distributed / comments
โ— Best result in write and the second best result in
   input โ†’ high performances volume

โ— Since volume is not striped, and no high client
   concurrency was used, we don't exploit the full
   potentialities of GlusterFS โ†’ done in subsequent
   tests

  Some other tests were also conducted on different
   mixed types of volumes (e.g. striped+replicated)
Overall comparison
Production volumes
โ— Tests conducted on two volumes used at INFN
  Torino computing center: the VM images repository
  and the disk where running VMs are hosted

โ— Tests executed without production services
  interruption โ†’ expect results to be slightly
  influenced by contemporary computing activities
  (even if they were not network-intensive)
Production volumes:
Imagerepo
                                  Images Repository

                              virtual-machine-img1
                              virtual-machine-img2
                              virtual-machine-img-3
                                        ...
                              virtual-machine-img-n




Network



          mount       mount                 mount                mount


    Hypervisor 1   Hypervisor 2          Hypervisor 3   ...   Hypervisor m
Production volumes: Vmdir

                              Service               Service
        Service              hypervisor            hypervisor             Service
       hypervisor                                                        hypervisor




                    I/O stream            I/O stream

I/O stream                                                      I/O stream




                                 GlusterFS volume
Production volumes / Results
Production volumes / Results (2)
                Average       Std. Deviation    Average     Std. Deviation     Average        Std. Deviation
               Sequential      Sequential      Sequential    Sequential       Sequential     Sequential Read
                Write per       Write per       Rewrite        Output          Read per      per Blocks [MB/s]
              Blocks [MB/s]   Blocks [MB/s]      [MB/s]     Rewrite [MB/s]   Blocks [MB/s]

   Image
                  64.4             3.3           38.0            0.4             98.3              2.3
 Repository

  Running
                  47.6             2.2           24.8            1.5             62.7              0.8
 VMs volume




โ— Imagerepo is a distributed volume (GlusterFS โ†’1 brick)
โ— Running VMs volume is a replicated volume โ†’ worse
  performances, but single point of failure eliminated by
  replicating both disks and servers
โ— Both volumes are more performant than the testbed ones
  โ†’ better underlying hardware resources used
PROOF test
โ— PROOF: ROOT-based framework for interactive
  (non-batch, unlike Grid) physics analysis, used
  by ALICE and ATLAS, officially part of the
  computing model

โ— Simulate a real use case โ†’ not artificial, with a
  storage constituted of 3 LUN (over a RAID5) of
  17 TB each in distributed mode

โ— many concurrent accesses: GlusterFS
  scalability is extensively exploited
PROOF test / Results
                                            Concurrent
                                                         MB/S
                                            Processes

                                                60       473

                                                66       511

                                                72       535

                                                78       573

                                                84       598

                                                96       562

                                                108      560



โ— Optimal range of concurrent accesses: 84-96
โ— Plateau beyond optimal range
Conclusions and possible
             developments
โ— GlusterFS v3.3.1 was considered stable and
  satisfying all the prerequisites needed from a
  network filesystem.
  โ†’ upgrade was performed and currently in use!
โ— Make some more tests (e.g. in different use
  cases)
โ— Look for next developments in GlusterFS v3.4.x
  โ†’ probably improvement and integration with
  QEMU/KVM

                 http://www.gluster.org/2012/11/integration-with-kvmqemu
Thanks for your attention

Thanks to:
โ— Prof. Massimo Masera
โ— Stefano Bagnasco
โ— Dario Berzano
Backup slides
GlusterFS actors




                   (source: www.gluster.org)
Conclusions: overall comparison
Striped + Replicated volume:
โ— it stripes data across replicated bricks in the
  cluster;

โ— one should use striped replicated volumes in
  highly concurrent environments where there
  is parallel access of very large files and
  performance is critical;
Striped + replicated / results




                Average        Std.       Average        Std.       Average         Std.
               Sequential   Deviation    Sequential   Deviation    Sequential   Deviation
               Output per   Sequential     Output     Sequential    Input per   Sequential
                 Blocks     Output per    Rewrite      Output        Blocks      Input per
                 [MB/s]      Blocks        [MB/s]      Rewrite        [MB/s]      Blocks
                              [MB/s]                    [MB/s]                     [MB/s]


   striped+      31.0          0.3         18.4          4.7         44.5          1.6
  replicated
Striped + replicated / comments
โ— Tests conducted on these volumes covered always
  one I/O process at time, so it's quite normal that a
  volume type thought for highly concurrent
  environments seems to be less performant.
โ— It keeps discrete I/O ratings.
Imagerepo / results




             Average        Std.       Average        Std.       Average         Std.
            Sequential   Deviation    Sequential   Deviation    Sequential   Deviation
            Output per   Sequential     Output     Sequential    Input per   Sequential
              Blocks     Output per    Rewrite      Output        Blocks      Input per
              [MB/s]      Blocks        [MB/s]      Rewrite        [MB/s]      Blocks
                           [MB/s]                    [MB/s]                     [MB/s]


imagerepo     98.3          3.3         38.0          0.4         64.4          2.3
Imagerepo / comments
โ— The input and output (per block) tests gave an high
  value compared with the previous tests, this due to
  the greater availability of resources.
โ— Imagerepo is the repository where are stored the
  images of virtual machines ready to be cloned and
  turned on in vmdir.
โ— It's very important that this repository is always up
  in order to avoid data loss, so is recommended to
  create a replicated repository.
Vmdir / results




         Average        Std.       Average        Std.       Average         Std.
        Sequential   Deviation    Sequential   Deviation    Sequential   Deviation
        Output per   Sequential     Output     Sequential    Input per   Sequential
          Blocks     Output per    Rewrite      Output        Blocks      Input per
          [MB/s]      Blocks        [MB/s]      Rewrite        [MB/s]      Blocks
                       [MB/s]                    [MB/s]                     [MB/s]


vmdir     47.6          2.2         24.8          1.5         62.7          0.8
vmdir / comments
โ— These result are worse than the imagerepo's ones
  but still better than the first three (test-volume).
โ— It is a volume shared from two server towards 5
  machines where are hosted the virtual machine
  instances, so is very important that this volume
  doesn't crash.
โ— It's the best candidate to be a
  replicated+striped+distributed volume.
Gluster                       Gluster                    Gluster                  Gluster
         FS         Brick              FS       Brick             FS      Brick            FS      Brick



Hypervisor                    Hypervisor                 Hypervisor               Hypervisor




                    p2p
                 connection

                                               GlusterFS volume
from: Gluster_File_System-3.3.0-Administration_Guide-en-US
(see more at: www.gluster.org/community/documentation)

More Related Content

What's hot

RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...
RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...
RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...Stefano Salsano
ย 
Mobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 Card
Mobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 CardMobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 Card
Mobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 Cardtelestax
ย 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsContinuent
ย 
Cache-partitioning
Cache-partitioningCache-partitioning
Cache-partitioningdavidkftam
ย 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemKonstantin V. Shvachko
ย 
Rdma presentation-kisti-v2
Rdma presentation-kisti-v2Rdma presentation-kisti-v2
Rdma presentation-kisti-v2balmanme
ย 

What's hot (7)

RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...
RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...
RDCL 3D, a Model Agnostic Web Framework for the Design and Composition of NFV...
ย 
Bglrsession4
Bglrsession4Bglrsession4
Bglrsession4
ย 
Mobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 Card
Mobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 CardMobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 Card
Mobicents Summit 2012 - Dmitri Soloviev - Telscale SS7 Card
ย 
Tungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten ReplicatorsTungsten University: Setup and Operate Tungsten Replicators
Tungsten University: Setup and Operate Tungsten Replicators
ย 
Cache-partitioning
Cache-partitioningCache-partitioning
Cache-partitioning
ย 
HDFS for Geographically Distributed File System
HDFS for Geographically Distributed File SystemHDFS for Geographically Distributed File System
HDFS for Geographically Distributed File System
ย 
Rdma presentation-kisti-v2
Rdma presentation-kisti-v2Rdma presentation-kisti-v2
Rdma presentation-kisti-v2
ย 

Similar to Distributed filesystems enable scalable cloud applications

RunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdfRunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdfOpenStack Foundation
ย 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container EcosystemVinay Rao
ย 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...OpenEBS
ย 
Neutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deploymentsNeutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deploymentsThomas Morin
ย 
21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentation21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentationdataplex systems limited
ย 
Pyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage systemPyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage systemViet-Trung TRAN
ย 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData Inc
ย 
An Optics Life
An Optics LifeAn Optics Life
An Optics LifeThomas Weible
ย 
OpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud Environments
OpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud EnvironmentsOpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud Environments
OpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud EnvironmentsJonas Vermeulen
ย 
Mpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchMpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchAricent
ย 
Netforce: extending neutron to support routed networks at scale in ebay
Netforce: extending neutron to support routed networks at scale in ebayNetforce: extending neutron to support routed networks at scale in ebay
Netforce: extending neutron to support routed networks at scale in ebayAliasgar Ginwala
ย 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopEvert Lammerts
ย 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computingBrian Bullard
ย 
Ocpeu14
Ocpeu14Ocpeu14
Ocpeu14KALRAY
ย 
OpenStack Networking and Automation
OpenStack Networking and AutomationOpenStack Networking and Automation
OpenStack Networking and AutomationAdam Johnson
ย 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageMayaData Inc
ย 
Hot sec10 slide-suzaki
Hot sec10 slide-suzakiHot sec10 slide-suzaki
Hot sec10 slide-suzakiKuniyasu Suzaki
ย 

Similar to Distributed filesystems enable scalable cloud applications (20)

RunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdfRunningQuantumOnQuantumAtNicira.pdf
RunningQuantumOnQuantumAtNicira.pdf
ย 
State of the Container Ecosystem
State of the Container EcosystemState of the Container Ecosystem
State of the Container Ecosystem
ย 
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
Container Attached Storage (CAS) with OpenEBS - Berlin Kubernetes Meetup - Ma...
ย 
Neutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deploymentsNeutron-to-Neutron: interconnecting multiple OpenStack deployments
Neutron-to-Neutron: interconnecting multiple OpenStack deployments
ย 
21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentation21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentation
ย 
Chep2012
Chep2012Chep2012
Chep2012
ย 
Kubernetes: My BFF
Kubernetes: My BFFKubernetes: My BFF
Kubernetes: My BFF
ย 
Introduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network IssuesIntroduction to Cloud Data Center and Network Issues
Introduction to Cloud Data Center and Network Issues
ย 
Pyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage systemPyramid: A large-scale array-oriented active storage system
Pyramid: A large-scale array-oriented active storage system
ย 
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...MayaData  Datastax webinar - Operating Cassandra on Kubernetes with the help ...
MayaData Datastax webinar - Operating Cassandra on Kubernetes with the help ...
ย 
An Optics Life
An Optics LifeAn Optics Life
An Optics Life
ย 
OpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud Environments
OpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud EnvironmentsOpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud Environments
OpenStack Summit Paris - Neutron & Nuage Networks in Private Cloud Environments
ย 
Mpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-marchMpls conference 2016-data center virtualisation-11-march
Mpls conference 2016-data center virtualisation-11-march
ย 
Netforce: extending neutron to support routed networks at scale in ebay
Netforce: extending neutron to support routed networks at scale in ebayNetforce: extending neutron to support routed networks at scale in ebay
Netforce: extending neutron to support routed networks at scale in ebay
ย 
Large-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with HadoopLarge-Scale Data Storage and Processing for Scientists with Hadoop
Large-Scale Data Storage and Processing for Scientists with Hadoop
ย 
What is cloud computing
What is cloud computingWhat is cloud computing
What is cloud computing
ย 
Ocpeu14
Ocpeu14Ocpeu14
Ocpeu14
ย 
OpenStack Networking and Automation
OpenStack Networking and AutomationOpenStack Networking and Automation
OpenStack Networking and Automation
ย 
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storageWebinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
Webinar: OpenEBS - Still Free and now FASTEST Kubernetes storage
ย 
Hot sec10 slide-suzaki
Hot sec10 slide-suzakiHot sec10 slide-suzaki
Hot sec10 slide-suzaki
ย 

Recently uploaded

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
ย 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
ย 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
ย 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
ย 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
ย 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
ย 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
ย 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
ย 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
ย 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
ย 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...gurkirankumar98700
ย 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
ย 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
ย 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araรบjo
ย 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
ย 
๐Ÿฌ The future of MySQL is Postgres ๐Ÿ˜
๐Ÿฌ  The future of MySQL is Postgres   ๐Ÿ˜๐Ÿฌ  The future of MySQL is Postgres   ๐Ÿ˜
๐Ÿฌ The future of MySQL is Postgres ๐Ÿ˜RTylerCroy
ย 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
ย 
Finology Group โ€“ Insurtech Innovation Award 2024
Finology Group โ€“ Insurtech Innovation Award 2024Finology Group โ€“ Insurtech Innovation Award 2024
Finology Group โ€“ Insurtech Innovation Award 2024The Digital Insurer
ย 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
ย 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
ย 

Recently uploaded (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
ย 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
ย 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
ย 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
ย 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
ย 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
ย 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
ย 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
ย 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
ย 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
ย 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service ๐Ÿธ 8923113531 ๐ŸŽฐ Avail...
ย 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
ย 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
ย 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
ย 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
ย 
๐Ÿฌ The future of MySQL is Postgres ๐Ÿ˜
๐Ÿฌ  The future of MySQL is Postgres   ๐Ÿ˜๐Ÿฌ  The future of MySQL is Postgres   ๐Ÿ˜
๐Ÿฌ The future of MySQL is Postgres ๐Ÿ˜
ย 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
ย 
Finology Group โ€“ Insurtech Innovation Award 2024
Finology Group โ€“ Insurtech Innovation Award 2024Finology Group โ€“ Insurtech Innovation Award 2024
Finology Group โ€“ Insurtech Innovation Award 2024
ย 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
ย 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
ย 

Distributed filesystems enable scalable cloud applications

  • 1. Network filesystems in heterogeneous cloud applications Supervisor: Massimo Masera (Univesitร  di Torino, INFN) Company Tutor: Stefano Bagnasco (INFN, TO) Tutor: Dario Berzano (INFN, TO) candidate: Matteo Concas
  • 2. Computing @LHC: how is the GRID structured? ATLAS - FZK - Catania (Karlsruhe) - Torino (~1 GB/s) CMS - CNAF* (Bologna) 15 PB/year - Bari of raw data ALICE - ... - ... - Legnaro -IN2P3 LHCb (Lyon) Tier-0 Tier-1 Tier-2 Data are distributed over a federated network called Grid, which is hierarchically organized in Tiers.
  • 3. Computing infrastructure @INFN Torino *V.M. = virtual machine Grid node. job submission Batch processes: submitted jobs are V.M.* queued and executed as soon as V.M. there are enough free resources. data retrieval V.M. Output is stored on Grid storage V.M. asynchronously. V.M. V.M. Alice Proof facility. continuous 2-way Interactive processes: all resources communication are allocated at the same time. Job splitting is dynamic and results are returned immediately to the client. Legacy Tier-2 re mo Data Storage te Generic virtual farms. log in VMs can be added dynamically and Data storage cloud storage removed as needed. End user Data storage cloud storage doesn't know how is his/her farm is Data storage cloud storage physically structured. New generation cloud storage
  • 5. Introduction: Distributed storage โ— Aggregation of several storages: โ—‹ Several nodes and disks seen as one pool in the same LAN (Local Area Network) โ—‹ Many pools aggregated geographically through WAN โ†’ cloud storage (Wide Area Network) โ—‹ Concurrent access by many clients is optimized โ†’ โ€œclosestโ€ replica Client 1 Site 1 LAN Client ... Client i Geo-replication WAN Client ... Site 2 LAN Client m-1 Client m Network filesystems are the backbone of these infrastructures
  • 6. Why distributing the storage? โ— Local disk pools: โ—‹ several disks: no single hard drive can be big enough โ†’ aggregate disks โ—‹ several nodes: some number crunching, and network, required to look up and serve data โ†’ distribute the load โ—‹ client scalability โ†’ serve many clients โ—‹ on local pools, filesystem operations (r, w, mkdir, etc.) are synchronous โ— Federated storage (scale is geographical): โ—‹ single site cannot contain all data โ—‹ moving job processing close to their data, not vice versa โ†’ distributed data โ‡” distributed computing โ—‹ filesystem operations are asynchronous
  • 7. Distributed storage solutions โ— Every distributed storage has: โ—‹ a backend which aggregates disks โ—‹ a frontend which serves data over a network โ— Many solutions: โ—‹ Lustre, GPFS, GFS โ†’ popular in the Grid world โ—‹ stackable, e.g.: aggregate with Lustre, serve with NFS โ— NFS is not a distributed storage โ†’ does not aggregate, only network
  • 8. Levels of aggregation in Torino โ— Hardware aggregation (RAID) of hard drives โ†’ virtual block devices (LUN: logical unit number) โ— Software aggregation of block devices โ†’ each LUN is aggregated using Oracle Lustre: โ—‹ separated server to keep "file information" (MDS: metadata server) โ—‹ one or more servers attached to the block devices (OSS: object storage servers) โ—‹ quasi-vertical scalability โ†’ "master" server (i.e., MDS) is a bottleneck, can add more (hard & critical work!) โ— Global federation โ†’ the local filesystem is exposed through xrootd: โ—‹ Torino's storage is part of a global federation โ—‹ used by the ALICE experiment @ CERN โ—‹ a global, external "file catalog" knows whether a file is in Torino or not
  • 9. What is GlusterFS โ— Open source, distributed network filesystem claiming to scale up to several petabytes and handling many clients โ— Horizontal scalability โ†’ distributed workload through "bricks" โ— Reliability: โ—‹ elastic management โ†’ maintenance operations are online โ—‹ can add, remove, replace without stopping service โ—‹ rebalance โ†’ when adding a new "brick", fill to ensure even distribution of data โ—‹ self-healing on "replicated" volumes โ†’ form of automatic failback & failover
  • 10. GlusterFS structure โ— GlusterFS servers cross-communicate with no central manager โ†’ horizontal scalability Gluster Gluster Gluster Gluster FS Brick FS Brick FS Brick FS Brick Hypervisor Hypervisor Hypervisor Hypervisor p2p connection GlusterFS volume
  • 12. Preliminary studies โ— Verify compatibility of GlusterFS precompiled packages (RPMs) on CentOS 5 and 6 for the production environment โ— Packages not available for development versions: new functionalities tested from source code (e.g. Object storage) โ— Test on virtual machines (first on local VirtualBox then on INFN Torino OpenNebula cloud) http://opennebula.org/
  • 13. Types of benchmarks โ— Generic stress benchmarks conducted on: โ—‹ Super distributed prototype โ—‹ Pre-existing production volumes โ— Specific stress benchmark conducted on some type of GlusterFS volumes (e.g. replicated volume) โ— Application specific tests: โ—‹ High energies physics analysis running on ROOT PROOF
  • 14. Note โ— Tests conducted in two different circumstances: a. storage built for the sole purpose of testing: the volume would be less performing than infrastructure ones for the benchmarks b. volumes of production were certainly subject to interferences due to concurrent processes "Why perform these tests?"
  • 15. Motivations โ— Verify consistency of the "release notes": โ†’ test all the different volume types: โ—‹ replicated โ—‹ striped โ—‹ distributed โ— Test GlusterFS in a realistic environment โ†’ build a prototype as similar as possible to production infrastructure
  • 16. Experimental setup โ— GlusterFS v3.3 turned out to be stable after tests conducted both on VirtualBox and OpenNebula VMs โ— Next step: build an experimental "super distributed" prototype: a realistic testbed environment consisting of: โ—‹ #40 HDDs [500 GB each]โ†’ ~20 TB (1 TBโ‰ƒ10^12 B) โ—‹ GlusterFS installed on every hypervisor โ—‹ Each hypervisor mounted 2 HDDs โ†’ 1 TB each โ—‹ all the hypervisors were connected each other (LAN) โ— Software used for benchmarks: bonnie++ โ—‹ very simple to use read/write benchmark for disks โ—‹ http://www.coker.com.au/bonnie++/
  • 17. Striped volume โ— used in high concurrency environments accessing large files (in our case ~10 GB); โ— useful to store large data sets, if they have to be accessed from multiple instances. (source: www.gluster. org)
  • 18. Striped volume / results Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Write Sequential Write Sequential Sequential Sequential Read Sequential Read per Blocks per Blocks Rewrite Rewrite [MB/s] per Blocks per Blocks [MB/s] [MB/s] [MB/s] [MB/s] [MB/s] striped 38.6 1.3 23.0 3.6 44.7 1.3
  • 19. Striped volume / comments Machine RAM Size of written size, although Each test is Software used files [MB] (at GlusterFS doesn't repeated 10 is bonnie++ least double have any sort of times v1.96 the RAM size) file cache > for i in {1..10}; do bonnie++ -d$SOMEPATH -s5000 -r2500 -f; done; โ— Has the second best result in write (per blocks), and the most stable one (lowest stddev)
  • 20. Replicated volume: โ— used where high-availability and high-reliability are critical โ— main task โ†’ create forms of redundancy: more important the data availability than high performances in I/O โ— requires a great use of resources, both disk space and CPU usage (especially during the self-healing procedure) (source: www.gluster. org)
  • 21. Replicated volume: โ— Self healing feature: given "N" redundant servers, if at maximum (N-1) crash โ†’ services keep running on the volume โ‡ servers restored โ†’ get synchronized with the one(s) that didn't crash โ— Self healing feature was tested turning off servers (even abruptly!) during I/O processes
  • 22. Replicated / results Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Sequential Sequential Sequential Sequential Sequential Write per Write per Rewrite Rewrite [MB/s] Read per Read per Blocks [MB/s] Blocks [MB/s] [MB/s] Blocks [MB/s] Blocks [MB/s] replicated 35.5 2.5 19.1 16.1 52.2 7.1
  • 23. Replicated / comments โ— Low rates in write and the best result in read โ†’ writes need to be synchronized, read throughput benefits from multiple sources โ— very important in building stable volumes in critical nodes โ— "Self healing" feature worked fine: uses all available cores during resynchronization process, and it does it online (i.e. with no service interruption, only slowdowns!)
  • 24. Distributed volume: โ— Files are spread across the bricks in a fashion that ensures uniform distribution โ— Pure distributed volume only if redundancy is not required or lies elsewhere (e.g. RAID) โ— If no redundancy, disk/server failure can result in loss of data, but only some bricks are affected, not the whole volume! (source: www.gluster. org)
  • 25. Distributed / results Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Sequential Sequential Sequential Sequential Sequential Write per Write per Rewrite Rewrite [MB/s] Read per Read per Blocks [MB/s] Blocks [MB/s] [MB/s] Blocks [MB/s] Blocks [MB/s] distributed 39.8 5.4 22.3 2.8 52.1 2.2
  • 26. Distributed / comments โ— Best result in write and the second best result in input โ†’ high performances volume โ— Since volume is not striped, and no high client concurrency was used, we don't exploit the full potentialities of GlusterFS โ†’ done in subsequent tests Some other tests were also conducted on different mixed types of volumes (e.g. striped+replicated)
  • 28. Production volumes โ— Tests conducted on two volumes used at INFN Torino computing center: the VM images repository and the disk where running VMs are hosted โ— Tests executed without production services interruption โ†’ expect results to be slightly influenced by contemporary computing activities (even if they were not network-intensive)
  • 29. Production volumes: Imagerepo Images Repository virtual-machine-img1 virtual-machine-img2 virtual-machine-img-3 ... virtual-machine-img-n Network mount mount mount mount Hypervisor 1 Hypervisor 2 Hypervisor 3 ... Hypervisor m
  • 30. Production volumes: Vmdir Service Service Service hypervisor hypervisor Service hypervisor hypervisor I/O stream I/O stream I/O stream I/O stream GlusterFS volume
  • 32. Production volumes / Results (2) Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Sequential Sequential Sequential Sequential Sequential Read Write per Write per Rewrite Output Read per per Blocks [MB/s] Blocks [MB/s] Blocks [MB/s] [MB/s] Rewrite [MB/s] Blocks [MB/s] Image 64.4 3.3 38.0 0.4 98.3 2.3 Repository Running 47.6 2.2 24.8 1.5 62.7 0.8 VMs volume โ— Imagerepo is a distributed volume (GlusterFS โ†’1 brick) โ— Running VMs volume is a replicated volume โ†’ worse performances, but single point of failure eliminated by replicating both disks and servers โ— Both volumes are more performant than the testbed ones โ†’ better underlying hardware resources used
  • 33. PROOF test โ— PROOF: ROOT-based framework for interactive (non-batch, unlike Grid) physics analysis, used by ALICE and ATLAS, officially part of the computing model โ— Simulate a real use case โ†’ not artificial, with a storage constituted of 3 LUN (over a RAID5) of 17 TB each in distributed mode โ— many concurrent accesses: GlusterFS scalability is extensively exploited
  • 34. PROOF test / Results Concurrent MB/S Processes 60 473 66 511 72 535 78 573 84 598 96 562 108 560 โ— Optimal range of concurrent accesses: 84-96 โ— Plateau beyond optimal range
  • 35. Conclusions and possible developments โ— GlusterFS v3.3.1 was considered stable and satisfying all the prerequisites needed from a network filesystem. โ†’ upgrade was performed and currently in use! โ— Make some more tests (e.g. in different use cases) โ— Look for next developments in GlusterFS v3.4.x โ†’ probably improvement and integration with QEMU/KVM http://www.gluster.org/2012/11/integration-with-kvmqemu
  • 36. Thanks for your attention Thanks to: โ— Prof. Massimo Masera โ— Stefano Bagnasco โ— Dario Berzano
  • 38. GlusterFS actors (source: www.gluster.org)
  • 40. Striped + Replicated volume: โ— it stripes data across replicated bricks in the cluster; โ— one should use striped replicated volumes in highly concurrent environments where there is parallel access of very large files and performance is critical;
  • 41. Striped + replicated / results Average Std. Average Std. Average Std. Sequential Deviation Sequential Deviation Sequential Deviation Output per Sequential Output Sequential Input per Sequential Blocks Output per Rewrite Output Blocks Input per [MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks [MB/s] [MB/s] [MB/s] striped+ 31.0 0.3 18.4 4.7 44.5 1.6 replicated
  • 42. Striped + replicated / comments โ— Tests conducted on these volumes covered always one I/O process at time, so it's quite normal that a volume type thought for highly concurrent environments seems to be less performant. โ— It keeps discrete I/O ratings.
  • 43. Imagerepo / results Average Std. Average Std. Average Std. Sequential Deviation Sequential Deviation Sequential Deviation Output per Sequential Output Sequential Input per Sequential Blocks Output per Rewrite Output Blocks Input per [MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks [MB/s] [MB/s] [MB/s] imagerepo 98.3 3.3 38.0 0.4 64.4 2.3
  • 44. Imagerepo / comments โ— The input and output (per block) tests gave an high value compared with the previous tests, this due to the greater availability of resources. โ— Imagerepo is the repository where are stored the images of virtual machines ready to be cloned and turned on in vmdir. โ— It's very important that this repository is always up in order to avoid data loss, so is recommended to create a replicated repository.
  • 45. Vmdir / results Average Std. Average Std. Average Std. Sequential Deviation Sequential Deviation Sequential Deviation Output per Sequential Output Sequential Input per Sequential Blocks Output per Rewrite Output Blocks Input per [MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks [MB/s] [MB/s] [MB/s] vmdir 47.6 2.2 24.8 1.5 62.7 0.8
  • 46. vmdir / comments โ— These result are worse than the imagerepo's ones but still better than the first three (test-volume). โ— It is a volume shared from two server towards 5 machines where are hosted the virtual machine instances, so is very important that this volume doesn't crash. โ— It's the best candidate to be a replicated+striped+distributed volume.
  • 47. Gluster Gluster Gluster Gluster FS Brick FS Brick FS Brick FS Brick Hypervisor Hypervisor Hypervisor Hypervisor p2p connection GlusterFS volume
  • 48. from: Gluster_File_System-3.3.0-Administration_Guide-en-US (see more at: www.gluster.org/community/documentation)