Network filesystems inheterogeneous cloud applicationsSupervisor: Massimo Masera (Univesità di Torino, INFN)Company Tutor:...
Computing @LHC: how is the GRID                structured?                     ATLAS                                      ...
Computing infrastructure @INFN                     Torino*V.M. = virtual machine                                          ...
Distributing and federating the storage
Introduction: Distributed storage   ● Aggregation of several storages:         ○ Several nodes and disks seen as one pool ...
Why distributing the storage?● Local disk pools:  ○ several disks: no single hard drive can be big enough →     aggregate ...
Distributed storage solutions● Every distributed storage has:  ○ a backend which aggregates disks  ○ a frontend which serv...
Levels of aggregation in Torino●   Hardware aggregation (RAID) of hard drives → virtual block devices    (LUN: logical uni...
What is GlusterFS● Open source, distributed network filesystem claiming to scale  up to several petabytes and handling man...
GlusterFS structure● GlusterFS servers cross-communicate with  no central manager → horizontal scalability       Gluster  ...
Stage activities
Preliminary studies● Verify compatibility of GlusterFS precompiled  packages (RPMs) on CentOS 5 and 6 for the  production ...
Types of benchmarks● Generic stress benchmarks conducted on:   ○ Super distributed prototype   ○ Pre-existing production v...
Note● Tests conducted in two different  circumstances:  a. storage built for the sole purpose of testing:     the volume w...
Motivations● Verify consistency of the "release notes":  → test all the different volume types:   ○ replicated   ○ striped...
Experimental setup● GlusterFS v3.3 turned out to be stable after tests  conducted both on VirtualBox and OpenNebula VMs● N...
Striped volume● used in high concurrency environments accessing  large files (in our case ~10 GB);● useful to store large ...
Striped volume / results             Average          Std. Deviation     Average     Std. Deviation      Average         S...
Striped volume / comments                                                      Machine RAM                                ...
Replicated volume:● used where high-availability and high-reliability are  critical● main task → create forms of redundanc...
Replicated volume:● Self healing feature: given "N" redundant  servers, if at maximum (N-1) crash → services  keep running...
Replicated / results                Average      Std. Deviation    Average     Std. Deviation      Average      Std. Devia...
Replicated / comments● Low rates in write and the best result in read →  writes need to be synchronized, read throughput  ...
Distributed volume:● Files are spread across the bricks in a fashion that  ensures uniform distribution● Pure distributed ...
Distributed / results                 Average      Std. Deviation    Average     Std. Deviation      Average      Std. Dev...
Distributed / comments● Best result in write and the second best result in   input → high performances volume● Since volum...
Overall comparison
Production volumes● Tests conducted on two volumes used at INFN  Torino computing center: the VM images repository  and th...
Production volumes:Imagerepo                                  Images Repository                              virtual-machi...
Production volumes: Vmdir                              Service               Service        Service              hyperviso...
Production volumes / Results
Production volumes / Results (2)                Average       Std. Deviation    Average     Std. Deviation     Average    ...
PROOF test● PROOF: ROOT-based framework for interactive  (non-batch, unlike Grid) physics analysis, used  by ALICE and ATL...
PROOF test / Results                                            Concurrent                                                ...
Conclusions and possible             developments● GlusterFS v3.3.1 was considered stable and  satisfying all the prerequi...
Thanks for your attentionThanks to:● Prof. Massimo Masera● Stefano Bagnasco● Dario Berzano
Backup slides
GlusterFS actors                   (source: www.gluster.org)
Conclusions: overall comparison
Striped + Replicated volume:● it stripes data across replicated bricks in the  cluster;● one should use striped replicated...
Striped + replicated / results                Average        Std.       Average        Std.       Average         Std.    ...
Striped + replicated / comments● Tests conducted on these volumes covered always  one I/O process at time, so its quite no...
Imagerepo / results             Average        Std.       Average        Std.       Average         Std.            Sequen...
Imagerepo / comments● The input and output (per block) tests gave an high  value compared with the previous tests, this du...
Vmdir / results         Average        Std.       Average        Std.       Average         Std.        Sequential   Devia...
vmdir / comments● These result are worse than the imagerepos ones  but still better than the first three (test-volume).● I...
Gluster                       Gluster                    Gluster                  Gluster         FS         Brick        ...
from: Gluster_File_System-3.3.0-Administration_Guide-en-US(see more at: www.gluster.org/community/documentation)
Upcoming SlideShare
Loading in...5
×

Presentazione laurea 1.2 matteo concas

137

Published on

My Bachelor presentation.

Published in: Technology, Business
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
137
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
4
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Presentazione laurea 1.2 matteo concas

  1. 1. Network filesystems inheterogeneous cloud applicationsSupervisor: Massimo Masera (Univesità di Torino, INFN)Company Tutor: Stefano Bagnasco (INFN, TO)Tutor: Dario Berzano (INFN, TO)candidate: Matteo Concas
  2. 2. Computing @LHC: how is the GRID structured? ATLAS - FZK - Catania (Karlsruhe) - Torino (~1 GB/s) CMS - CNAF* (Bologna) 15 PB/year - Bari of raw data ALICE - ... - ... - Legnaro -IN2P3 LHCb (Lyon) Tier-0 Tier-1 Tier-2Data are distributed over a federated network called Grid, which is hierarchically organized in Tiers.
  3. 3. Computing infrastructure @INFN Torino*V.M. = virtual machine Grid node. job submission Batch processes: submitted jobs are V.M.* queued and executed as soon as V.M. there are enough free resources. data retrieval V.M. Output is stored on Grid storage V.M. asynchronously. V.M. V.M. Alice Proof facility. continuous 2-way Interactive processes: all resources communication are allocated at the same time. Job splitting is dynamic and results are returned immediately to the client.Legacy Tier-2 re moData Storage te Generic virtual farms. log in VMs can be added dynamically and Data storage cloud storage removed as needed. End user Data storage cloud storage doesnt know how is his/her farm is Data storage cloud storage physically structured. New generation cloud storage
  4. 4. Distributing and federating the storage
  5. 5. Introduction: Distributed storage ● Aggregation of several storages: ○ Several nodes and disks seen as one pool in the same LAN (Local Area Network) ○ Many pools aggregated geographically through WAN → cloud storage (Wide Area Network) ○ Concurrent access by many clients is optimized → “closest” replica Client 1 Site 1 LAN Client ... Client iGeo-replication WAN Client ... Site 2 LAN Client m-1 Client m Network filesystems are the backbone of these infrastructures
  6. 6. Why distributing the storage?● Local disk pools: ○ several disks: no single hard drive can be big enough → aggregate disks ○ several nodes: some number crunching, and network, required to look up and serve data → distribute the load ○ client scalability → serve many clients ○ on local pools, filesystem operations (r, w, mkdir, etc.) are synchronous● Federated storage (scale is geographical): ○ single site cannot contain all data ○ moving job processing close to their data, not vice versa → distributed data ⇔ distributed computing ○ filesystem operations are asynchronous
  7. 7. Distributed storage solutions● Every distributed storage has: ○ a backend which aggregates disks ○ a frontend which serves data over a network● Many solutions: ○ Lustre, GPFS, GFS → popular in the Grid world ○ stackable, e.g.: aggregate with Lustre, serve with NFS● NFS is not a distributed storage → does not aggregate, only network
  8. 8. Levels of aggregation in Torino● Hardware aggregation (RAID) of hard drives → virtual block devices (LUN: logical unit number)● Software aggregation of block devices → each LUN is aggregated using Oracle Lustre: ○ separated server to keep "file information" (MDS: metadata server) ○ one or more servers attached to the block devices (OSS: object storage servers) ○ quasi-vertical scalability → "master" server (i.e., MDS) is a bottleneck, can add more (hard & critical work!)● Global federation → the local filesystem is exposed through xrootd: ○ Torinos storage is part of a global federation ○ used by the ALICE experiment @ CERN ○ a global, external "file catalog" knows whether a file is in Torino or not
  9. 9. What is GlusterFS● Open source, distributed network filesystem claiming to scale up to several petabytes and handling many clients● Horizontal scalability → distributed workload through "bricks"● Reliability: ○ elastic management → maintenance operations are online ○ can add, remove, replace without stopping service ○ rebalance → when adding a new "brick", fill to ensure even distribution of data ○ self-healing on "replicated" volumes → form of automatic failback & failover
  10. 10. GlusterFS structure● GlusterFS servers cross-communicate with no central manager → horizontal scalability Gluster Gluster Gluster Gluster FS Brick FS Brick FS Brick FS BrickHypervisor Hypervisor Hypervisor Hypervisor p2p connection GlusterFS volume
  11. 11. Stage activities
  12. 12. Preliminary studies● Verify compatibility of GlusterFS precompiled packages (RPMs) on CentOS 5 and 6 for the production environment● Packages not available for development versions: new functionalities tested from source code (e.g. Object storage)● Test on virtual machines (first on local VirtualBox then on INFN Torino OpenNebula cloud) http://opennebula.org/
  13. 13. Types of benchmarks● Generic stress benchmarks conducted on: ○ Super distributed prototype ○ Pre-existing production volumes● Specific stress benchmark conducted on some type of GlusterFS volumes (e.g. replicated volume)● Application specific tests: ○ High energies physics analysis running on ROOT PROOF
  14. 14. Note● Tests conducted in two different circumstances: a. storage built for the sole purpose of testing: the volume would be less performing than infrastructure ones for the benchmarks b. volumes of production were certainly subject to interferences due to concurrent processes "Why perform these tests?"
  15. 15. Motivations● Verify consistency of the "release notes": → test all the different volume types: ○ replicated ○ striped ○ distributed● Test GlusterFS in a realistic environment → build a prototype as similar as possible to production infrastructure
  16. 16. Experimental setup● GlusterFS v3.3 turned out to be stable after tests conducted both on VirtualBox and OpenNebula VMs● Next step: build an experimental "super distributed" prototype: a realistic testbed environment consisting of: ○ #40 HDDs [500 GB each]→ ~20 TB (1 TB≃10^12 B) ○ GlusterFS installed on every hypervisor ○ Each hypervisor mounted 2 HDDs → 1 TB each ○ all the hypervisors were connected each other (LAN)● Software used for benchmarks: bonnie++ ○ very simple to use read/write benchmark for disks ○ http://www.coker.com.au/bonnie++/
  17. 17. Striped volume● used in high concurrency environments accessing large files (in our case ~10 GB);● useful to store large data sets, if they have to be accessed from multiple instances. (source: www.gluster. org)
  18. 18. Striped volume / results Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Write Sequential Write Sequential Sequential Sequential Read Sequential Read per Blocks per Blocks Rewrite Rewrite [MB/s] per Blocks per Blocks [MB/s] [MB/s] [MB/s] [MB/s] [MB/s]striped 38.6 1.3 23.0 3.6 44.7 1.3
  19. 19. Striped volume / comments Machine RAM Size of written size, although Each test is Software used files [MB] (at GlusterFS doesnt repeated 10 is bonnie++ least double have any sort of times v1.96 the RAM size) file cache> for i in {1..10}; do bonnie++ -d$SOMEPATH -s5000 -r2500 -f; done; ● Has the second best result in write (per blocks), and the most stable one (lowest stddev)
  20. 20. Replicated volume:● used where high-availability and high-reliability are critical● main task → create forms of redundancy: more important the data availability than high performances in I/O● requires a great use of resources, both disk space and CPU usage (especially during the self-healing procedure) (source: www.gluster. org)
  21. 21. Replicated volume:● Self healing feature: given "N" redundant servers, if at maximum (N-1) crash → services keep running on the volume ⇝ servers restored → get synchronized with the one(s) that didnt crash● Self healing feature was tested turning off servers (even abruptly!) during I/O processes
  22. 22. Replicated / results Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Sequential Sequential Sequential Sequential Sequential Write per Write per Rewrite Rewrite [MB/s] Read per Read per Blocks [MB/s] Blocks [MB/s] [MB/s] Blocks [MB/s] Blocks [MB/s]replicated 35.5 2.5 19.1 16.1 52.2 7.1
  23. 23. Replicated / comments● Low rates in write and the best result in read → writes need to be synchronized, read throughput benefits from multiple sources● very important in building stable volumes in critical nodes● "Self healing" feature worked fine: uses all available cores during resynchronization process, and it does it online (i.e. with no service interruption, only slowdowns!)
  24. 24. Distributed volume:● Files are spread across the bricks in a fashion that ensures uniform distribution● Pure distributed volume only if redundancy is not required or lies elsewhere (e.g. RAID)● If no redundancy, disk/server failure can result in loss of data, but only some bricks are affected, not the whole volume! (source: www.gluster. org)
  25. 25. Distributed / results Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Sequential Sequential Sequential Sequential Sequential Write per Write per Rewrite Rewrite [MB/s] Read per Read per Blocks [MB/s] Blocks [MB/s] [MB/s] Blocks [MB/s] Blocks [MB/s]distributed 39.8 5.4 22.3 2.8 52.1 2.2
  26. 26. Distributed / comments● Best result in write and the second best result in input → high performances volume● Since volume is not striped, and no high client concurrency was used, we dont exploit the full potentialities of GlusterFS → done in subsequent tests Some other tests were also conducted on different mixed types of volumes (e.g. striped+replicated)
  27. 27. Overall comparison
  28. 28. Production volumes● Tests conducted on two volumes used at INFN Torino computing center: the VM images repository and the disk where running VMs are hosted● Tests executed without production services interruption → expect results to be slightly influenced by contemporary computing activities (even if they were not network-intensive)
  29. 29. Production volumes:Imagerepo Images Repository virtual-machine-img1 virtual-machine-img2 virtual-machine-img-3 ... virtual-machine-img-nNetwork mount mount mount mount Hypervisor 1 Hypervisor 2 Hypervisor 3 ... Hypervisor m
  30. 30. Production volumes: Vmdir Service Service Service hypervisor hypervisor Service hypervisor hypervisor I/O stream I/O streamI/O stream I/O stream GlusterFS volume
  31. 31. Production volumes / Results
  32. 32. Production volumes / Results (2) Average Std. Deviation Average Std. Deviation Average Std. Deviation Sequential Sequential Sequential Sequential Sequential Sequential Read Write per Write per Rewrite Output Read per per Blocks [MB/s] Blocks [MB/s] Blocks [MB/s] [MB/s] Rewrite [MB/s] Blocks [MB/s] Image 64.4 3.3 38.0 0.4 98.3 2.3 Repository Running 47.6 2.2 24.8 1.5 62.7 0.8 VMs volume● Imagerepo is a distributed volume (GlusterFS →1 brick)● Running VMs volume is a replicated volume → worse performances, but single point of failure eliminated by replicating both disks and servers● Both volumes are more performant than the testbed ones → better underlying hardware resources used
  33. 33. PROOF test● PROOF: ROOT-based framework for interactive (non-batch, unlike Grid) physics analysis, used by ALICE and ATLAS, officially part of the computing model● Simulate a real use case → not artificial, with a storage constituted of 3 LUN (over a RAID5) of 17 TB each in distributed mode● many concurrent accesses: GlusterFS scalability is extensively exploited
  34. 34. PROOF test / Results Concurrent MB/S Processes 60 473 66 511 72 535 78 573 84 598 96 562 108 560● Optimal range of concurrent accesses: 84-96● Plateau beyond optimal range
  35. 35. Conclusions and possible developments● GlusterFS v3.3.1 was considered stable and satisfying all the prerequisites needed from a network filesystem. → upgrade was performed and currently in use!● Make some more tests (e.g. in different use cases)● Look for next developments in GlusterFS v3.4.x → probably improvement and integration with QEMU/KVM http://www.gluster.org/2012/11/integration-with-kvmqemu
  36. 36. Thanks for your attentionThanks to:● Prof. Massimo Masera● Stefano Bagnasco● Dario Berzano
  37. 37. Backup slides
  38. 38. GlusterFS actors (source: www.gluster.org)
  39. 39. Conclusions: overall comparison
  40. 40. Striped + Replicated volume:● it stripes data across replicated bricks in the cluster;● one should use striped replicated volumes in highly concurrent environments where there is parallel access of very large files and performance is critical;
  41. 41. Striped + replicated / results Average Std. Average Std. Average Std. Sequential Deviation Sequential Deviation Sequential Deviation Output per Sequential Output Sequential Input per Sequential Blocks Output per Rewrite Output Blocks Input per [MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks [MB/s] [MB/s] [MB/s] striped+ 31.0 0.3 18.4 4.7 44.5 1.6 replicated
  42. 42. Striped + replicated / comments● Tests conducted on these volumes covered always one I/O process at time, so its quite normal that a volume type thought for highly concurrent environments seems to be less performant.● It keeps discrete I/O ratings.
  43. 43. Imagerepo / results Average Std. Average Std. Average Std. Sequential Deviation Sequential Deviation Sequential Deviation Output per Sequential Output Sequential Input per Sequential Blocks Output per Rewrite Output Blocks Input per [MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks [MB/s] [MB/s] [MB/s]imagerepo 98.3 3.3 38.0 0.4 64.4 2.3
  44. 44. Imagerepo / comments● The input and output (per block) tests gave an high value compared with the previous tests, this due to the greater availability of resources.● Imagerepo is the repository where are stored the images of virtual machines ready to be cloned and turned on in vmdir.● Its very important that this repository is always up in order to avoid data loss, so is recommended to create a replicated repository.
  45. 45. Vmdir / results Average Std. Average Std. Average Std. Sequential Deviation Sequential Deviation Sequential Deviation Output per Sequential Output Sequential Input per Sequential Blocks Output per Rewrite Output Blocks Input per [MB/s] Blocks [MB/s] Rewrite [MB/s] Blocks [MB/s] [MB/s] [MB/s]vmdir 47.6 2.2 24.8 1.5 62.7 0.8
  46. 46. vmdir / comments● These result are worse than the imagerepos ones but still better than the first three (test-volume).● It is a volume shared from two server towards 5 machines where are hosted the virtual machine instances, so is very important that this volume doesnt crash.● Its the best candidate to be a replicated+striped+distributed volume.
  47. 47. Gluster Gluster Gluster Gluster FS Brick FS Brick FS Brick FS BrickHypervisor Hypervisor Hypervisor Hypervisor p2p connection GlusterFS volume
  48. 48. from: Gluster_File_System-3.3.0-Administration_Guide-en-US(see more at: www.gluster.org/community/documentation)
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×