Hands On Gluster with Jeff Darcy

Gluster
Tutorial
Jeff Darcy, Red Hat
LISA 2016 (Boston)

Agenda
▸ Alternating info-dump and hands-on
▹ This is part of the info-dump ;)
▸ Gluster basics
▸ Initial setup
▸ Extra features
▸ Maintenance and trouble-shooting

Who Am I?
▸ One of three project-wide architects
▸ First Red Hat employee to be seriously
involved with Gluster (before
acquisition)
▸ Previously worked on NFS (v2..v4),
Lustre, PVFS2, others
▸ General distributed-storage blatherer
▹ http://pl.atyp.us / @Obdurodon

TEMPLATE CREDITS
Special thanks to all the people who made and released these
awesome resources for free:
▸ Presentation template by SlidesCarnival
▸ Photographs by Death to the Stock Photo (license)

Some Terminology
▸ A brick is simply a directory on a server
▸ We use translators to combine bricks
into more complex subvolumes
▹ For scale, replication, sharding, ...
▸ This forms a translator graph,
contained in a volfile
▸ Internal daemons (e.g. self heal) use the
same bricks arranged into slightly
different volfiles

Hands On: Getting Started
1. Use the RHGS test drive
▹ http://bit.ly/glustertestdrive
2. Start a Fedora/CentOS VM
▹ Use yum/dnf to install gluster
▹ base, libs, server, fuse, client-xlators, cli
3. Docker Docker Docker
▹ https://github.com/gluster/gluster-containers

Brick / Translator Example
Server A
/brick1
Server B
/brick2
Server C
/brick3
Server D
/brick4

Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2
A subvolume
Also a subvolume

Server A
/brick1
Server B
/brick2
Replica
Set 1
Server C
/brick3
Server D
/brick4
Replica
Set 2
Volume
“fubar”

Translator Patterns
Server A
/brick1
Server B
/brick2
Replica
Set 1
Fan-out or “cluster”
e.g. AFR, EC, DHT, ...
AFR
md-cache
Pass through
e.g. performance

Access Methods
FUSE
Samba
Ganesha
TCMU
GFAPI
Self heal
Rebalance
Quota
Snapshot
Bitrot

GlusterD
▸ Management daemon
▸ Maintains membership, detects server
failures
▸ Stages configuration changes
▸ Starts and monitors other daemons

Simple Configuration Example
serverA# gluster peer probe serverB
serverA# gluster volume create fubar
replica 2
serverA:/brick1 serverB:/brick2
serverA# gluster volume start fubar
clientX# mount -t glusterfs serverA:fubar
/mnt/gluster_fubar

Hands On: Connect Servers
[root@vagrant-testVM glusterfs]# gluster peer probe
192.168.121.66
peer probe: success.
[root@vagrant-testVM glusterfs]# gluster peer status
Number of Peers: 1
Hostname: 192.168.121.66
Uuid: 95aee0b5-c816-445b-8dbc-f88da7e95660
State: Accepted peer request (Connected)

Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume create fubar
replica 2 testvm:/d/backends/fubar{0,1} force
volume create: fubar: success: please start the volume to
access data
[root@vagrant-testVM glusterfs]# gluster volume info fubar
... (see for yourself)
[root@vagrant-testVM glusterfs]# gluster volume status fubar
Volume fubar is not started

Hands On: Server Volume Setup
[root@vagrant-testVM glusterfs]# gluster volume start fubar
volume start: fubar: success
[root@vagrant-testVM glusterfs]# gluster volume status fubar
Status of volume: fubar
Gluster process TCP Port RDMA Port Online Pid
------------------------------------------------------------------------------
Brick testvm:/d/backends/fubar0 49152 0 Y 13104
Brick testvm:/d/backends/fubar1 49153 0 Y 13133
Self-heal Daemon on localhost N/A N/A Y 13163
Task Status of Volume fubar
------------------------------------------------------------------------------
There are no active volume tasks

Hands On: Client Volume Setup
[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:fubar
/mnt/glusterfs/0
[root@vagrant-testVM glusterfs]# df /mnt/glusterfs/0
Filesystem 1K-blocks Used Available Use% Mounted on
testvm:fubar 5232640 33280 5199360 1% /mnt/glusterfs/0
[root@vagrant-testVM glusterfs]# ls -a /mnt/glusterfs/0
. ..
[root@vagrant-testVM glusterfs]# ls -a /d/backends/fubar0
. .. .glusterfs

Hands On: It’s a Filesystem!
▸ Create some files
▸ Create directories, symlinks, ...
▸ Rename, delete, ...
▸ Test performance
▹ OK, not yet

Distribution and Rebalancing
Server X’s range Server Y’s range
0 0x7fffffff 0xffffffff
● Each brick “claims” a range of hash values
○ Collection of claims is called a layout
● Files (dots) are hashed, placed on brick
claiming that range
● When bricks are added, claims are adjusted to
minimize data motion

Distribution and Rebalancing
0 0x80000000 0xffffffff
0 0x55555555 0xaaaaaaaa 0xffffffff
Server Z’s range
Move X->Z Move Y->Z

Sharding
▸ Divides files into chunks
▸ Each chunk is placed separately
according to hash
▸ High probability (not certainty) of
chunks being on different subvolumes
▸ Spreads capacity and I/O across
subvolumes

Hands On: Adding a Brick
[root@vagrant-testVM glusterfs]# gluster volume create xyzzy
testvm:/d/backends/xyzzy{0,1}
[root@vagrant-testVM glusterfs]# getfattr -d -e hex
-m trusted.glusterfs.dht /d/backends/xyzzy{0,1}
# file: d/backends/xyzzy0
trusted.glusterfs.dht=0x0000000100000000000000007ffffffe
trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

[root@vagrant-testVM glusterfs]# gluster volume add-brick xyzzy
testvm:/d/backends/xyzzy2
volume add-brick: success
[root@vagrant-testVM glusterfs]# gluster volume rebalance xyzzy
fix-layout start
volume rebalance: xyzzy: success: Rebalance on xyzzy has been started
successfully. Use rebalance status command to check status of the
rebalance process.
ID: 88782248-7c12-4ba8-97f6-f5ce6815963

[root@vagrant-testVM glusterfs]# getfattr -d -e hex -m
trusted.glusterfs.dht /d/backends/xyzzy{0,1,2}
trusted.glusterfs.dht=0x00000001000000000000000055555554
trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff
trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9

Split Brain (problem definition)
▸ “Split brain” is when we don’t have
enough information to determine
correct recovery action
▸ Can be caused by node failure or
network partition
▸ Every distributed data store has to
prevent and/or deal with it

How Replication Works
▸ Client sends operation (e.g. write) to all
replicas directly
▸ Coordination: pre-op, post-op, locking
▹ enables recovery in case of failure
▸ Self-heal (repair) usually done by
internal daemon

Split Brain (how it happens)
Server A
Client X
Client Y
Server B
Network
partition

Split Brain (what it looks like)
[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0
ls: cannot access /mnt/glusterfs/0/best-sf: Input/output error
best-sf
[root@vagrant-testVM glusterfs]# cat /mnt/glusterfs/0/best-sf
cat: /mnt/glusterfs/0/best-sf: Input/output error
[root@vagrant-testVM glusterfs]# cat /d/backends/fubar0/best-sf
star trek
[root@vagrant-testVM glusterfs]# cat /d/backends/fubar1/best-sf
star wars
What the...?

Split Brain (dealing with it)
▸ Primary mechanism: quorum
▹ server side, client side, or both
▹ arbiters
▸ Secondary: rule-based resolution
▹ e.g. largest, latest timestamp
▹ Thanks, Facebook!
▸ Last choice: manual repair

Server Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Has no servers
Forced down

Client Side Quorum
Brick A Brick B Brick C
Client X Client Y
Writes succeed Writes rejected locally
(EROFS)
Stays up

Erasure Coding
▸ Encode N input blocks into N+K output
blocks, so that original can be recovered
from any N.
▸ RAID is erasure coding with K=1 (RAID 5)
or K=2 (RAID 6)
▸ Our implementation mostly has the
same flow as replication

Quota
▸ Gluster supports directory-level quota
▸ For nested directories, lowest applicable
limit applies
▸ Soft and hard limits
▹ Exceeding soft limit gets logged
▹ Exceeding hard limit gets EDQUOT

Quota
▸ Problem: global vs. local limits
▹ quota is global (per volume)
▹ files are pseudo-randomly distributed
across bricks
▸ How do we enforce this?
▸ Quota daemon exists to handle this
coordination

Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy enable
volume quota : success
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy soft-timeout 0
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy hard-timeout 0
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy
limit-usage /john 100MB

Hands On: Quota
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list
Path Hard-limit Soft-limit
-----------------------------------------------------------------
/john 100.0MB 80%(80.0MB)
Used Available Soft-limit exceeded? Hard-limit exceeded?
--------------------------------------------------------------
0Bytes 100.0MB No No

Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero
of=/mnt/glusterfs/0/john/bigfile bs=1048576 count=85 conv=sync
85+0 records in
85+0 records out
89128960 bytes (89 MB) copied, 1.83037 s, 48.7 MB/s
[root@vagrant-testVM glusterfs]# grep -i john /var/log/glusterfs/bricks/*
/var/log/glusterfs/bricks/d-backends-xyzzy0.log:[2016-11-29 14:31:44.581934]
A [MSGID: 120004] [quota.c:4973:quota_log_usage] 0-xyzzy-quota: Usage
crossed soft limit: 80.0MB used by /john

Hands On: Quota
[root@vagrant-testVM glusterfs]# dd if=/dev/zero
of=/mnt/glusterfs/0/john/bigfile2 bs=1048576 count=85 conv=sync
dd: error writing '''/mnt/glusterfs/0/john/bigfile2''': Disk quota exceeded
[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list | cut -c
66-
Used Available Soft-limit exceeded? Hard-limit exceeded?
--------------------------------------------------------------
101.9MB 0Bytes Yes Yes

Snapshots
▸ Gluster supports read-only snapshots
and writable clones of snapshots
▸ Also, snapshot restores
▸ Support is based on / tied to LVM thin
provisioning
▹ originally supposed to be more
platform-agnostic
▹ maybe some day it really will be

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# fallocate -l $((100*1024*1024))
/tmp/snap-brick0
[root@vagrant-testVM glusterfs]# losetup --show -f /tmp/snap-brick0
/dev/loop3
[root@vagrant-testVM glusterfs]# vgcreate snap-vg0 /dev/loop3
Volume group "snap-vg0" successfully created

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# lvcreate -L 50MB -T /dev/snap-vg0/thinpool
Rounding up size to full physical extent 52.00 MiB
Logical volume "thinpool" created.
[root@vagrant-testVM glusterfs]# lvcreate -V 200MB -T /dev/snap-vg0/thinpool
-n snap-lv0
Logical volume "snap-lv0" created.
[root@vagrant-testVM glusterfs]# mkfs.xfs /dev/snap-vg0/snap-lv0
...
[root@vagrant-testVM glusterfs]# mount /dev/snap-vg0/snap-lv0
/d/backends/xyzzy0
...

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster volume create xyzzy
testvm:/d/backends/xyzzy{0,1} force
[root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file1
[root@vagrant-testVM glusterfs]# gluster snapshot create snap1 xyzzy
snapshot create: success: Snap snap1_GMT-2016.11.29-14.57.11 created
successfully

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster snapshot activate
snap1_GMT-2016.11.29-14.57.11
Snapshot activate: snap1_GMT-2016.11.29-14.57.11: Snap activated
successfully
[root@vagrant-testVM glusterfs]# mount -t glusterfs
testvm:/snaps/snap1_GMT-2016.11.29-14.57.11/xyzzy /mnt/glusterfs/1
file1 file2
-bash: /mnt/glusterfs/1/file3: Read-only file system

Hands On: Snapshots
[root@vagrant-testVM glusterfs]# gluster snapshot clone clone1
snap1_GMT-2016.11.29-14.57.11
snapshot clone: success: Clone clone1 created successfully
[root@vagrant-testVM glusterfs]# gluster volume start clone1
volume start: clone1: success
[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:/clone1
/mnt/glusterfs/2
[root@vagrant-testVM glusterfs]# echo goodbye > /mnt/glusterfs/2/file3

Hands On: Snapshots
# Unmount and stop clone.
# Stop original volume - but leave snapshot activated!
[root@vagrant-testVM glusterfs]# gluster snapshot restore snap1_GMT-2016.11.29-14.57.11
Restore operation will replace the original volume with the snapshotted volume. Do you still want to
continue? (y/n) y
Snapshot restore: snap1_GMT-2016.11.29-14.57.11: Snap restored successfully
[root@vagrant-testVM glusterfs]# gluster volume start xyzzy
volume start: xyzzy: success
file1 file2

Other Features
▸ Geo-replication
▸ Bitrot detection
▸ Transport security
▸ Encryption, compression/dedup etc. can
be done locally on bricks

Gluster 4.x
▸ GlusterD 2
▹ higher scale + interfaces + smarts
▸ Server-side replication
▸ DHT improvements for scale
▸ More multitenancy
▹ subvolume mounts, throttling/QoS

Thank You!
http://gluster.org
jdarcy@redhat.com

Hands On Gluster with Jeff Darcy

More Related Content

What's hot

Similar to Hands On Gluster with Jeff Darcy

More from Gluster.org

Recently uploaded

Hands On Gluster with Jeff Darcy