GlusterFS – What, Why, How ...
Vijay Bellur
GlusterFS architect
Red Hat
03/19/15
Agenda
● 4Ws and a H
● Why GlusterFS?
● What is GlusterFS?
● How does GlusterFS work?
● Where is GlusterFS used?
● When does feature foo appear in GlusterFS?
● Q & A
03/19/15
Why GlusterFS?
03/19/15
Why GlusterFS?
● 2.5+ exabytes of data produced every
day!
● 90% of data in last two years
● Data needs to be stored somewhere!
● Commoditization and Democratization –
way to go
source: http://www-01.ibm.com/software/data/bigdata/what-is-big-data.html
03/19/15
What is GlusterFS?
03/19/15
What is GlusterFS?
● Scale-out distributed storage system.
● Aggregates storage exports over network
interconnects to provide an unified namespace.
● Layered on disk file systems that support extended
attributes.
● Provides file, object and block interfaces for data
access.
03/19/15
How does GlusterFS work?
03/19/15
Typical GlusterFS Deployment
03/19/15
GlusterFS Architecture – Foundations
● Software only, runs on commodity hardware
● No external metadata servers
● Scale-out with Elasticity
● Extensible and modular
● Deployment agnostic
03/19/15
Concepts & Algorithms
03/19/15
GlusterFS concepts – Trusted Storage Pool
Node2
Probe
Probe
accepted
Node 1 and Node 2 are peers in a trusted storage pool
Node2Node1
Node1
03/19/15
GlusterFS concepts – Trusted Storage Pool
Node1 Node2 Node3Node2Node1 Trusted Storage Pool
Node3Node2Node1
Detach
03/19/15

A brick is the combination of a node and an export directory – for e.g. hostname:/dir

Each brick inherits limits of the underlying filesystem

No limit on the number of bricks per node

Data and metadata get stored on bricks
/export3 /export3 /export3
Storage Node
/export1
Storage Node
/export2
/export1
/export2
/export4
/export5
Storage Node
/export1
/export2
3 bricks 5 bricks 3 bricks
GlusterFS concepts - Bricks
03/19/15
GlusterFS concepts - Volumes
● Logical collection of bricks.
● Identified by an administrator provided name.
● Volume or a part of the volume can be mounted in a client
● mount -t glusterfs server1:/<volname>
/my/mnt/point
03/19/15
GlusterFS concepts - Volumes
Node2Node1 Node3
/export/brick1
/export/brick2
/export/brick1
/export/brick2
/export/brick1
/export/brick2
music
Videos
03/19/15
Volume Types
➢Type of a volume is specified at the time of volume
creation
➢ Volume type determines how and where data is
placed
➢ Following volume types are supported in
glusterfs:
1) Distribute
2) Stripe
3) Replication
4) Distributed Replicate
5) Striped Replicate
6) Distributed Striped Replicate
7) Disperse
03/19/15
Distributed Volume
➢Distributes files across various bricks of the volume.
➢Directories are present on all bricks of the volume.
➢Removes the need for an external meta data server,
provides O(1) lookup.
03/19/15
How does a distributed volume work?
03/19/15
How does a distributed volume work?
03/19/15
How does a distributed volume work?
03/19/15
Replicated Volume
● Synchronous replication of all updates.
● Provides HA for data.
● Transaction driven for ensuring consistency.
● Changelogs maintained for re-conciliation.
● Any number of replicas can be configured.
03/19/15
How does a replicated volume work?
03/19/15
How does a replicated volume work?
03/19/15
Distributed Replicated Volume
03/19/15
Disperse Volume
● Introduced in GlusterFS 3.6
● Erasure Coding / RAID 5 over the network
● “Disperses” data on to various bricks
● IDA : Reed solomon codes
● Non–systematic erasure coding
● Encoding / decoding done on client side
03/19/15
Access Mechanisms
03/19/15
Access Mechanisms
Gluster volumes can be accessed via the following
mechanisms:
– FUSE based Native protocol
– NFSv3
– SMB
– libgfapi
– ReST/HTTP
– HDFS
– iSCSI
03/19/15
FUSE based native access
03/19/15
libgfapi access
03/19/15
NFSv3 access with Gluster NFS
03/19/15
Nfs-Ganesha with GlusterFS
03/19/15
SMB with GlusterFS
03/19/15
Object/ReST - SwiftonFile
Client
Proxy Account
Container
Object
HTTP Request
( Swift REST
API)
Directory
Volume
FileClient
NFS or
GlusterFS Mount
● Unified File and object view.
● Entity mapping between file and object building
blocks
03/19/15
HDFS access
03/19/15
Features
● Scale-out NAS
● Elasticity
● Directory & Volume quotas
● Data Protection and Recovery
● Volume and File Snapshots
● User Serviceable Snapshots
● Geographic/Asynchronous replication
● Archival
● Read-only
● WORM
03/19/15
Features
● Performance
● Client side in memory caching for performance
● Data, metadata and readdir caching
● Monitoring
● Built in io statistics
● /proc like interface for introspection
● Provisioning
● puppet-gluster
● gluster-deploy
● More..
03/19/15
GlusterFS & oVirt
Row 1 Row 2 Row 3 Row 4
0
2
4
6
8
10
12
Column 1
Column 2
Column 3
03/19/15
GlusterFS Monitoring with Nagios
http://www.ovirt.org/Features/Nagios_Integration
03/19/15
How is it implemented?
03/19/15
Translators in GlusterFS
● Building blocks for a GlusterFS process.
● Based on Translators in GNU HURD.
● Each translator is a functional unit.
● Translators can be stacked together for
achieving desired functionality.
● Translators are deployment agnostic – can be
loaded in either the client or server stacks.
03/19/15
Customizable Translator Stack
03/19/15
Develop with GlusterFS
● Write your own translator
● Possible in C and Python as of today
● Write your custom application using libgfapi
● Interfaces similar to system calls
● C, python, java and go bindings available
● Talk to Gluster through ReST APIs
03/19/15
Where is GlusterFS used?
03/19/15
GlusterFS Use Cases
Source: 2014
GlusterFS user survey
03/19/15
Ecosystem Integration
● Currently used with various ecosystem projects
● Virtualization
– OpenStack
– oVirt
– Qemu
– CloudStack
● Big Data Analytics
– Hadoop
– Tachyon
● File Sync and Share
– ownCloud
03/19/15
Nova Nodes
(GlusterFS as
ephemeral storage)
Swift
Objects
Cinder
Data
Glance
Data
Swift
Data
Swift API
Storage
Server
(GlusterFS)
Storage
Server
(GlusterFS)
Storage
Server
(GlusterFS)
KVM
KVM
KVM
…
Fuse/libgfapi
Manila
Data
Storage
Server
(GlusterFS)
Block Images Files
Objects
Openstack Juno + GlusterFS – Current
Integration
03/19/15
When and What in GlusterFS.next?
03/19/15
New Features in GlusterFS 3.7
● Data Tiering
● Bitrot detection
● Sharding
● Performance improvements
● Better protection against split brain
● NFSv4, 4.1 and pNFS access using NFS Ganesha
● Netgroups style configuration for NFS
03/19/15
Data Tiering in 3.7
● Policy based data movement across hot and
cold tiers
● New translator for identifying candidates for
promotion/demotion
● Enables better utilization of different classes of
storage device/SSDs
Tier Xlator
HOT DHT COLD DHT
Replication Xlator
Other Client Xlator
HOT Tier
POSIX Xlator
CTR Xlator
Other Server Xlator
Brick Storage
Heat Data
Store
POSIX Xlator
CTR Xlator
Other Server Xlator
Brick Storage
Heat Data
Store
COLD Tier
Demotion
Promotion
03/19/15
Bitrot detection in 3.7
● Detection of silent data corruption
● Checksum associated with each file
● Periodic data scrubbing
● Detection also upon access
03/19/15
Sharding in 3.7
● Solves fragmentation in Gluster volumes
● Chunks and places data in any node that has
space
● Suitable for large file workloads
03/19/15
Netgroups and Exports for NFS in 3.7
● More advanced configuration for authentication based
on /etc/exports like syntax
● Support for netgroups
● Patches written at Facebook
● Forward ported from 3.4 to 3.7
● Cleanups and posted for review
03/19/15
NFS Ganesha support in 3.7
● Supports NFSv4, NFSv4.1 with Kerberos
● pNFS support for Gluster Volumes follows later
● Modifications to Gluster internals
● Upcall infrastructure
● Gluster CLI to manage NFS Genesha
● libgfapi improvements
● High-Availability based on Pacemaker and Corosync
03/19/15
Small-file performance enhancements in 3.7
● Multithreaded epoll (transport layer)
● In memory caching of stat and extended attributes on
the bricks
● Improvements for directory reads
● Tiering of data within a volume
03/19/15
Features beyond GlusterFS 3.7
● HyperConvergence with oVirt
● At rest compression
● De-duplication
● Multi-protocol support with NFS, FUSE and SMB
● Native ReST APIs for gluster management
● More integration with OpenStack, Containers
03/19/15
Hyperconverged oVirt – GlusterFS
● Server nodes are used both
for virtualization and serving
replicated images from the
GlusterFS Bricks
● The boxes can be
standardized (hardware and
deployment) for easy addition
and replacement
● Support for both scaling up,
adding more disks, and
scaling out, adding more hosts VMsandStorageEngine
GlusterFS Volume
Bricks Bricks Bricks
03/19/15
GlusterFS 4.0
03/19/15
GlusterFS 4.0
● Not Evolutionary anymore
● Intended for massive scalability and manageability
improvements, remove known bottlenecks
● Make it easier for devops to provison, manage and
monitor
● Enable larger deployments and new use cases
03/19/15
GlusterFS 4.0
● New Style Replication
● Improved Distributed hashing Translator
● Composite operations in the GlusterFS RPC protocol
● Support for multiple networks
● Coherent client-side caching
● Advanced data tiering
● ... and much more
03/19/15
GlusterFS next releases
● 3.7 – April 2015
● 3.7+ - April 2015 till 4.0
● 4.0 - 2016
03/19/15
Resources
Mailing lists:
gluster-users@gluster.org
gluster-devel@nongnu.org
IRC:
#gluster and #gluster-dev on freenode
Web:
http://www.gluster.org
Thank you!

Gluster fs architecture_future_directions_tlv