Kaleb KEITHLEY1
Gluster Technical Overview
Kaleb KEITHLEY
Red Hat
11 April, 2015
Kaleb KEITHLEY2
● Translator — a shared library that implements storage
semantics
● e.g. distribution, replication, performance, etc.
● Translator Stack — a directed graph of translators
● Brick — a server and a directory
● Volume — a collection of bricks
Concepts:
Kaleb KEITHLEY3
● Server: glusterfsd — the heart of it all
● Clients:
● FUSE, glusterfs
● NFS server, glusterfs (gluster client, NFS server)
● Management: glusterd
● Healing: glustershd
● And more——
Gluster Processes
Kaleb KEITHLEY4
Srv1% gluster peer probe srv1
...
srv1% gluster volume create my_vol 
srv1:/bricks/my_vol
srv1% gluster volume start my_vol
…
client1% mount ­t glusterfs srv1:my_vol 
/mnt
Let's get real — creating a volume
Kaleb KEITHLEY5
Simple Translator Stack — Linear
protocol/server
debug/io-stats
features/marker
performance/io-threads
features/locks
features/access-control
storage/posix protocols/client
performance/write-behind
performance/read-ahead
performance/io-cache
performance/quick-read
Performance/stat-prefetch
Debug/io-stats
Kaleb KEITHLEY6
server client
glusterd glusterfsd
kernel
glusterfs
brick
kernel
/mnt
application
vfsvfs fuse
Kaleb KEITHLEY7
srv1% gluster volume create my_dist 
srv1:/bricks/my_vol srv2:/bricks/my_vol
srv1% gluster volume start my_dist
…
client1% mount ­t glusterfs srv1:my_dist /mnt
Let's get real — creating a distribute volume
Kaleb KEITHLEY8
Complex Translator Stack — Distribute (DHT)
protocol/server
debug/io-stats
features/access-control
storage/posix
...
performance/stat-prefetch
debug/io-stats
cluster/distribute
performance/write-behind
protocol/clientprotocol/client
...
protocol/server
debug/io-stats
features/access-control
storage/posix
...
Kaleb KEITHLEY9
server 1
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel
/mnt
application
vfsfuse
server 2
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY10
nfs/server
Complex Translator Stack — DHT + NFS
protocol/server
debug/io-stats
features/access-control
storage/posix
...
performance/stat-prefetch
debug/io-stats
cluster/distribute
performance/write-behind
protocol/clientprotocol/client
...
protocol/server
debug/io-stats
features/access-control
storage/posix
...
nfs clients
Kaleb KEITHLEY11
server 1
glusterd glusterfsd
kernelbrick
vfs
server 2
glusterd glusterfsd
kernelbrick
vfs
glusterfs glusterfs
nfs clients nfs clients
Kaleb KEITHLEY12
srv1% gluster volume create replica 2 my_repl 
srv1:/bricks/my_vol srv2:/bricks/my_vol
srv1% gluster volume start my_repl
…
client1% mount ­t glusterfs srv1:my_repl /mnt
Let's get real — creating a replica volume
Kaleb KEITHLEY13
Complex Translator Stack — Replica (AFR)
protocol/server
debug/io-stats
features/access-control
storage/posix
...
performance/stat-prefetch
debug/io-stats
cluster/replica
performance/write-behind
protocol/clientprotocol/client
...
protocol/server
debug/io-stats
features/access-control
storage/posix
...
Kaleb KEITHLEY14
server 1
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel
/mnt
application
vfsfuse
server 2
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY15
● Hot and cold data
Tiering: a variation on distribution
Hot tier
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel /mnt
application
vfsfuse
Cold tier
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY16
Adding a brick — distributed and rebalancing
server 1
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel /mnt
application
vfsfuse
server 2
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY17
server 1
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel
/mnt
application
vfsfuse
server 2
glusterd glusterfsd
kernelbrick
vfs
server 3
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY18
Adding a brick — replica and healing
server 1
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel /mnt
application
vfsfuse
server 2
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY19
server 1
glusterd glusterfsd
kernelbrick
vfs
client
glusterfs
kernel
/mnt
application
vfsfuse
server 2
glusterd glusterfsd
kernelbrick
vfs
server 3
glusterd glusterfsd
kernelbrick
vfs
Kaleb KEITHLEY20
client
glusterfsd glusterfs
kernel
/mnt
application
vfsfuse
Isn't FUSE slow?
● Count the copies
● And context switches
Kaleb KEITHLEY21
client
glusterfsd
kernel
/mnt
application
vfsfuse
gfapi to the rescue
● POSIX-like API
● glfs_open()
● glfs_close()
● glfs_read(), glfs_write()
● glfs_seek()
● glfs_stat()
● etc.
Kaleb KEITHLEY22
libgfapi — hellogluster.c
#include <gluster/api/glfs.h>
int main (int argc, char** argv)
{
fs = glfs_new (“fsync”);
ret = glfs_set_volfile_server (fs, “tcp”, “localhost”, 
24007);
/* or ret = glfs_set_volfile (fs, “/tmp/foo.vol”); */
ret = glfs_init (fs);
fd = glfs_creat (fs, filename, O_RDWR, 0644);
ret = glfs_write (fd, “hello gluster”, 14, 0);
glfs_close (fd);
return 0;
}
Kaleb KEITHLEY23
Writing a translator in Python with GluPy
io-cache
dht
GluPy python
gluster.py
libglusterfs
......fop
cbk
cbk
fop
STACK_UNWIND
STACK_WIND
ctypes (ffi)
Kaleb KEITHLEY24
SwiftOnFile — RESTful API, examples with curl
●
Create a container: curl ­v ­X PUT ­H 'X­Auth­Token: $authtoken' 
https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername ­k
● List containers: curl ­v ­X GET ­H 'X­Auth­Token: $authtoken' 
https://$myhostname:443/v1/AUTH_$myvolname ­k
● Copy a file into a container (upload): curl ­v ­X PUT ­T $filename ­H 
'X­Auth­Token: $authtoken' ­H 'Content­Length: $filelen' 
https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/
$filename ­k
● Copy a file from a container (download): curl ­v ­X GET ­H 'X­Auth­
Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/
$mycontainername/$filename ­k > $filename
Kaleb KEITHLEY25
Translator basics
● Translators are shared objects (shlibs)
● Methods
● int32_t init(xlator_t *this);
● void fini(xlator_t *this);
● Data
● struct xlator_fops fops { ... };
● struct xlator_cbks cbks { };
● struct volume_options options [] = { ... };
● Client, Server, Client/Server
● Threads: write MT-SAFE
● Portability: GlusterFS != Linux only
● License: GPLv2 or LGPLv3+
Kaleb KEITHLEY26
Every method has a different signature
● Open fop method and callback
typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *, 
loc_t *, int32_t, fd_t *, dict_t *);
typedef int32_t (*fop_open_cbk_t) (call_frame_t *, void *, 
xlator_t *, int32_t, int32_t, fd_t *, dict_t *);
● Rename fop method and callback
typedef int32_t (*fop_rename_t) (call_frame_t *, xlator_t *, 
loc_t *, loc_t *, dict_t *);
typedef int32_t (*fop_rename_cbk_t) (call_frame_t , void *, 
xlator_t *, int32_t, int32_t, struct iatt *, struct iatt *, 
struct iatt *, struct iatt *, struct iatt *);
Kaleb KEITHLEY27
Data Types in Translators
● call_frame_t —
● xlator_t — translator context
● inode_t — represents a file on disk; ref-counted
● fd_t — represents an open file; ref-counted
● iatt_t — ~= struct stat
● dict_t — ~= Python dict (or C++ std::map)
● client_t — represents the connect client
Kaleb KEITHLEY28
fop methods and fop callbacks
uidmap_writev (...)
{
...
STACK_WIND (frame, uidmap_writev_cbk,
FIRST_CHILD (this),
FIRST_CHILD (this)­>fops­>writev,
fd, vector, count, offset, iobref);
/* DANGER ZONE */
return 0;
}
● Effectively lose control after STACK_WIND
● Callback might have already happened
● Or might be running right now
● Or maybe it's not going to run 'til later
Kaleb KEITHLEY29
fop methods and fop callback methods, cont.
uidmap_writev_cbk (call_frame_t *frame, void *cookie, ...)
{
...
STACK_UNWIND_STRICT (writev, frame
op_ret, op_errno, prebuf, postbuf);
return 0;
}
● The I/O is complete when the callback is called
Kaleb KEITHLEY30
STACK_WIND versus STACK_WIND_COOKIE
● Pass extra data to the cbk with
STACK_WIND_COOKIE
quota_statfs (call_frame_t *frame,
xlator_t *this, loc_t *loc)
{
inode_t *root_inode = loc­>inode­>table­>root;
STACK_WIND_COOKIE (frame, quota_statfs_cbk,
root_inode, FIRST_CHILD (this),
FIRST_CHILD (this)­>fops­>statfs, loc, xdata);
return 0;
}
● There is also frame->local
● shared by all STACK_WIND callbacks
Kaleb KEITHLEY31
STACK_WIND, STACK_WIND_COOKIE, cont.
● Pass extra data to the cbk with
STACK_WIND_COOKIE
quota_statfs_cbk (call_frame_t *frame, void *cookie, ...)
{
inode_t *root_inode = cookie;
...
}
Kaleb KEITHLEY32
STACK_UNWIND versus STACK_UNWIND_STRICT
● STACK_UNWIND_STRICT uses the correct type
/* return from function in a type­safe way */
#define STACK_UNWIND (frame, params ...)
do {
ret_fn_t fn = frame­>ret;
...
versus
#define STACK_UNWIND_STRICT (op, frame, params ...)
do {
fop_##op##_cbk_t fn = (fop_##op##_cbk_t)frame­>ret;
...
● And why wouldn't you want strong typing?
Kaleb KEITHLEY33
Calling multiple children (fan out)
afr_writev_wind (...)
{
...
for (i = 0; i < priv­>child_count; i++) {
if (local­>transaction.pre_op[i]) {
STACK_WIND_COOKIE (frame, afr_writev_wind_cbk,
(void *) (long) i,
priv­>children[i],
priv­>children[i]­>fops­>writev,
local­>fd, ...);
}
}
return 0;
}
Kaleb KEITHLEY34
Calling multiple children, cont. (fan in)
afr_writev_wind_cbk (...)
{
LOCK (&frame­>lock);
callcnt = ­­local­>call_count;
UNLOCK (&frame­>lock);
if (callcnt == 0) /* we're done */
...
}
● failure by any one child means the whole transaction
failed?
● And needs to be handled accordingly
Kaleb KEITHLEY35
Dealing With Errors: I/O errors
uidmap_writev_cbk (call_frame_t *frame, void *cookie,
xlator_t *this, int32_t op_ret, int32_t op_errno, ...)
{
...
STACK_UNWIND_STRICT (writev, frame, ­1, EIO, ...);
return 0;
}
● op_ret: 0 or -1, success or failure
● op_errno: from <errno.h>
● Use an op_errno that's valid and/or relevant for the fop
Kaleb KEITHLEY36
Dealing With Errors: FOP method errors
uidmap_writev (call_frame_t *frame, xlator_t *this, ...)
{
...
if (horrible_logic_error_must_abort) {
goto error; /* glusterfs idiom */
}
STACK_WIND(frame, uid_writev_cbk, ...);
return 0;
error:
STACK_UNWIND_STRICT (writev, frame, ­1, EIO, NULL, NULL);
return 0;
}
Kaleb KEITHLEY37
Call to action
● Go forth and write applications for GlusterFS!
Kaleb KEITHLEY38
£©€é¥¢ßŒ

Gluster technical overview

  • 1.
    Kaleb KEITHLEY1 Gluster TechnicalOverview Kaleb KEITHLEY Red Hat 11 April, 2015
  • 2.
    Kaleb KEITHLEY2 ● Translator— a shared library that implements storage semantics ● e.g. distribution, replication, performance, etc. ● Translator Stack — a directed graph of translators ● Brick — a server and a directory ● Volume — a collection of bricks Concepts:
  • 3.
    Kaleb KEITHLEY3 ● Server:glusterfsd — the heart of it all ● Clients: ● FUSE, glusterfs ● NFS server, glusterfs (gluster client, NFS server) ● Management: glusterd ● Healing: glustershd ● And more—— Gluster Processes
  • 4.
  • 5.
    Kaleb KEITHLEY5 Simple TranslatorStack — Linear protocol/server debug/io-stats features/marker performance/io-threads features/locks features/access-control storage/posix protocols/client performance/write-behind performance/read-ahead performance/io-cache performance/quick-read Performance/stat-prefetch Debug/io-stats
  • 6.
    Kaleb KEITHLEY6 server client glusterdglusterfsd kernel glusterfs brick kernel /mnt application vfsvfs fuse
  • 7.
  • 8.
    Kaleb KEITHLEY8 Complex TranslatorStack — Distribute (DHT) protocol/server debug/io-stats features/access-control storage/posix ... performance/stat-prefetch debug/io-stats cluster/distribute performance/write-behind protocol/clientprotocol/client ... protocol/server debug/io-stats features/access-control storage/posix ...
  • 9.
    Kaleb KEITHLEY9 server 1 glusterdglusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse server 2 glusterd glusterfsd kernelbrick vfs
  • 10.
    Kaleb KEITHLEY10 nfs/server Complex TranslatorStack — DHT + NFS protocol/server debug/io-stats features/access-control storage/posix ... performance/stat-prefetch debug/io-stats cluster/distribute performance/write-behind protocol/clientprotocol/client ... protocol/server debug/io-stats features/access-control storage/posix ... nfs clients
  • 11.
    Kaleb KEITHLEY11 server 1 glusterdglusterfsd kernelbrick vfs server 2 glusterd glusterfsd kernelbrick vfs glusterfs glusterfs nfs clients nfs clients
  • 12.
  • 13.
    Kaleb KEITHLEY13 Complex TranslatorStack — Replica (AFR) protocol/server debug/io-stats features/access-control storage/posix ... performance/stat-prefetch debug/io-stats cluster/replica performance/write-behind protocol/clientprotocol/client ... protocol/server debug/io-stats features/access-control storage/posix ...
  • 14.
    Kaleb KEITHLEY14 server 1 glusterdglusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse server 2 glusterd glusterfsd kernelbrick vfs
  • 15.
    Kaleb KEITHLEY15 ● Hotand cold data Tiering: a variation on distribution Hot tier glusterd glusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse Cold tier glusterd glusterfsd kernelbrick vfs
  • 16.
    Kaleb KEITHLEY16 Adding abrick — distributed and rebalancing server 1 glusterd glusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse server 2 glusterd glusterfsd kernelbrick vfs
  • 17.
    Kaleb KEITHLEY17 server 1 glusterdglusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse server 2 glusterd glusterfsd kernelbrick vfs server 3 glusterd glusterfsd kernelbrick vfs
  • 18.
    Kaleb KEITHLEY18 Adding abrick — replica and healing server 1 glusterd glusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse server 2 glusterd glusterfsd kernelbrick vfs
  • 19.
    Kaleb KEITHLEY19 server 1 glusterdglusterfsd kernelbrick vfs client glusterfs kernel /mnt application vfsfuse server 2 glusterd glusterfsd kernelbrick vfs server 3 glusterd glusterfsd kernelbrick vfs
  • 20.
    Kaleb KEITHLEY20 client glusterfsd glusterfs kernel /mnt application vfsfuse Isn'tFUSE slow? ● Count the copies ● And context switches
  • 21.
    Kaleb KEITHLEY21 client glusterfsd kernel /mnt application vfsfuse gfapi tothe rescue ● POSIX-like API ● glfs_open() ● glfs_close() ● glfs_read(), glfs_write() ● glfs_seek() ● glfs_stat() ● etc.
  • 22.
    Kaleb KEITHLEY22 libgfapi —hellogluster.c #include <gluster/api/glfs.h> int main (int argc, char** argv) { fs = glfs_new (“fsync”); ret = glfs_set_volfile_server (fs, “tcp”, “localhost”,  24007); /* or ret = glfs_set_volfile (fs, “/tmp/foo.vol”); */ ret = glfs_init (fs); fd = glfs_creat (fs, filename, O_RDWR, 0644); ret = glfs_write (fd, “hello gluster”, 14, 0); glfs_close (fd); return 0; }
  • 23.
    Kaleb KEITHLEY23 Writing atranslator in Python with GluPy io-cache dht GluPy python gluster.py libglusterfs ......fop cbk cbk fop STACK_UNWIND STACK_WIND ctypes (ffi)
  • 24.
    Kaleb KEITHLEY24 SwiftOnFile —RESTful API, examples with curl ● Create a container: curl ­v ­X PUT ­H 'X­Auth­Token: $authtoken'  https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername ­k ● List containers: curl ­v ­X GET ­H 'X­Auth­Token: $authtoken'  https://$myhostname:443/v1/AUTH_$myvolname ­k ● Copy a file into a container (upload): curl ­v ­X PUT ­T $filename ­H  'X­Auth­Token: $authtoken' ­H 'Content­Length: $filelen'  https://$myhostname:443/v1/AUTH_$myvolname/$mycontainername/ $filename ­k ● Copy a file from a container (download): curl ­v ­X GET ­H 'X­Auth­ Token: $authtoken' https://$myhostname:443/v1/AUTH_$myvolname/ $mycontainername/$filename ­k > $filename
  • 25.
    Kaleb KEITHLEY25 Translator basics ●Translators are shared objects (shlibs) ● Methods ● int32_t init(xlator_t *this); ● void fini(xlator_t *this); ● Data ● struct xlator_fops fops { ... }; ● struct xlator_cbks cbks { }; ● struct volume_options options [] = { ... }; ● Client, Server, Client/Server ● Threads: write MT-SAFE ● Portability: GlusterFS != Linux only ● License: GPLv2 or LGPLv3+
  • 26.
    Kaleb KEITHLEY26 Every methodhas a different signature ● Open fop method and callback typedef int32_t (*fop_open_t) (call_frame_t *, xlator_t *,  loc_t *, int32_t, fd_t *, dict_t *); typedef int32_t (*fop_open_cbk_t) (call_frame_t *, void *,  xlator_t *, int32_t, int32_t, fd_t *, dict_t *); ● Rename fop method and callback typedef int32_t (*fop_rename_t) (call_frame_t *, xlator_t *,  loc_t *, loc_t *, dict_t *); typedef int32_t (*fop_rename_cbk_t) (call_frame_t , void *,  xlator_t *, int32_t, int32_t, struct iatt *, struct iatt *,  struct iatt *, struct iatt *, struct iatt *);
  • 27.
    Kaleb KEITHLEY27 Data Typesin Translators ● call_frame_t — ● xlator_t — translator context ● inode_t — represents a file on disk; ref-counted ● fd_t — represents an open file; ref-counted ● iatt_t — ~= struct stat ● dict_t — ~= Python dict (or C++ std::map) ● client_t — represents the connect client
  • 28.
    Kaleb KEITHLEY28 fop methodsand fop callbacks uidmap_writev (...) { ... STACK_WIND (frame, uidmap_writev_cbk, FIRST_CHILD (this), FIRST_CHILD (this)­>fops­>writev, fd, vector, count, offset, iobref); /* DANGER ZONE */ return 0; } ● Effectively lose control after STACK_WIND ● Callback might have already happened ● Or might be running right now ● Or maybe it's not going to run 'til later
  • 29.
    Kaleb KEITHLEY29 fop methodsand fop callback methods, cont. uidmap_writev_cbk (call_frame_t *frame, void *cookie, ...) { ... STACK_UNWIND_STRICT (writev, frame op_ret, op_errno, prebuf, postbuf); return 0; } ● The I/O is complete when the callback is called
  • 30.
    Kaleb KEITHLEY30 STACK_WIND versusSTACK_WIND_COOKIE ● Pass extra data to the cbk with STACK_WIND_COOKIE quota_statfs (call_frame_t *frame, xlator_t *this, loc_t *loc) { inode_t *root_inode = loc­>inode­>table­>root; STACK_WIND_COOKIE (frame, quota_statfs_cbk, root_inode, FIRST_CHILD (this), FIRST_CHILD (this)­>fops­>statfs, loc, xdata); return 0; } ● There is also frame->local ● shared by all STACK_WIND callbacks
  • 31.
    Kaleb KEITHLEY31 STACK_WIND, STACK_WIND_COOKIE,cont. ● Pass extra data to the cbk with STACK_WIND_COOKIE quota_statfs_cbk (call_frame_t *frame, void *cookie, ...) { inode_t *root_inode = cookie; ... }
  • 32.
    Kaleb KEITHLEY32 STACK_UNWIND versusSTACK_UNWIND_STRICT ● STACK_UNWIND_STRICT uses the correct type /* return from function in a type­safe way */ #define STACK_UNWIND (frame, params ...) do { ret_fn_t fn = frame­>ret; ... versus #define STACK_UNWIND_STRICT (op, frame, params ...) do { fop_##op##_cbk_t fn = (fop_##op##_cbk_t)frame­>ret; ... ● And why wouldn't you want strong typing?
  • 33.
    Kaleb KEITHLEY33 Calling multiplechildren (fan out) afr_writev_wind (...) { ... for (i = 0; i < priv­>child_count; i++) { if (local­>transaction.pre_op[i]) { STACK_WIND_COOKIE (frame, afr_writev_wind_cbk, (void *) (long) i, priv­>children[i], priv­>children[i]­>fops­>writev, local­>fd, ...); } } return 0; }
  • 34.
    Kaleb KEITHLEY34 Calling multiplechildren, cont. (fan in) afr_writev_wind_cbk (...) { LOCK (&frame­>lock); callcnt = ­­local­>call_count; UNLOCK (&frame­>lock); if (callcnt == 0) /* we're done */ ... } ● failure by any one child means the whole transaction failed? ● And needs to be handled accordingly
  • 35.
    Kaleb KEITHLEY35 Dealing WithErrors: I/O errors uidmap_writev_cbk (call_frame_t *frame, void *cookie, xlator_t *this, int32_t op_ret, int32_t op_errno, ...) { ... STACK_UNWIND_STRICT (writev, frame, ­1, EIO, ...); return 0; } ● op_ret: 0 or -1, success or failure ● op_errno: from <errno.h> ● Use an op_errno that's valid and/or relevant for the fop
  • 36.
    Kaleb KEITHLEY36 Dealing WithErrors: FOP method errors uidmap_writev (call_frame_t *frame, xlator_t *this, ...) { ... if (horrible_logic_error_must_abort) { goto error; /* glusterfs idiom */ } STACK_WIND(frame, uid_writev_cbk, ...); return 0; error: STACK_UNWIND_STRICT (writev, frame, ­1, EIO, NULL, NULL); return 0; }
  • 37.
    Kaleb KEITHLEY37 Call toaction ● Go forth and write applications for GlusterFS!
  • 38.

Editor's Notes

  • #6 Here&amp;apos;s a simple single brick volume. You can see which translators run on the client And which ones run on the server In the lower left is the brick, the real disk drive where the data is stored. In the upper left is the “virtual” disk on the client. A write to the virtual disk on the client traverses all the translators down to the protocols/client translator, where both the data and metadata are marshaled and sent to the server, and down through the server&amp;apos;s stack to the disk. And then the results come back the opposite path.
  • #9 Here&amp;apos;s a similar setup, but now we have added another brick and and are using DHT. See how the I/O is split by the cluster/distribute xlator and sent to both servers.
  • #11 A similar setup, but now we&amp;apos;ve added NFS by adding a NFS xlator. Color here is important. On the previous slides I used different colors to indicate which pieces ran on different machines. That&amp;apos;s still true here, but notice that the NFS server “client” stack runs on one of the servers. Thus you can see how GlusterFS uses NFS with DHT (or AFR).
  • #14 Here&amp;apos;s a similar setup, but now we have added another brick and and are using DHT. See how the I/O is split by the cluster/distribute xlator and sent to both servers.
  • #23 And here&amp;apos;s how to handle some unrecoverable error, logic or otherwise, in the fop. Note the glusterfs idiom (retch)
  • #25 Here&amp;apos;s an example of community at work. Gluster never provided a -devel rpm until we did it for Fedora. HekaFS source tree is a good starting point for adding your own translators for building out-of-tree Or build in-tree. On modern hardware the whole gluster build takes only a couple of minutes. Need example of files to change to add new xlator
  • #26 fops: function pointer table or vtble with its own “sub-API” cbks: never used in the translator, legacy. Run-time will dlsym it, so you have to have it. Think carefully about where your translator should run. Client? Server? Either one? GlusterFS is threaded, your xlator must be MT-SAFE. This is LinuxCon, yes, and GlusterFS is developed on Linux, but— Say some words about the pieces and their license(s)
  • #27 here we compare the fop and cbk signatures for open(), and also for rename(), just emphasize the point that fops and cbks have different signatures. Perhaps that&amp;apos;s obvious—
  • #28 call_frame_t, context for the I/O as it traverses down and back up the stack of translators. Valid for the duration of this particular I/O xlator_t, context about the particular translator the I/O is in at this moment in time. Valid for as long as the translator is loaded, i.e. effectively indefinite. inode_t and fd_t. It is possible for them to go away. If you need them to stay around, bump their ref count. Don&amp;apos;t forget to decrement the ref count when you&amp;apos;re done with it.
  • #29 Here&amp;apos;s an fop method for writev(). Do all your work before passing the I/O on to the next translator in the chain, i.e. Before calling STACK_WIND. Note cbk, in this case uidmap_writev_cbk(). No I/O occurs here. It all happens sometime between the STACK_WIND and the invocation of the cbk. N.B. In this case we&amp;apos;re only passing the I/O to a single child. In the danger zone? Only do clean-up. E.g. Release local stuff. Make sure it&amp;apos;s not in use further down the call tree.
  • #30 Here&amp;apos;s the cbk. The I/O is complete at this point, return back up the stack with STACK_REWIND Notice the cookie parameter. More about this in the next slide
  • #31 compare STACK_WIND to STACK_WIND_COOKIE. Notice &amp;apos;inode&amp;apos; and that it&amp;apos;s being passed as an extra param The &amp;apos;cookie&amp;apos; is only shared between fop and cbk There&amp;apos;s also frame-&amp;gt;local which is shared by all fops and cbks, but be careful you don&amp;apos;t overwrite something that&amp;apos;s already there.
  • #32 Here we see that cookie != NULL when the fop used STACK_WIND_COOKIE
  • #33 This slide speaks for itself. Is there a reason to use STACK_UNWIND? Ever?
  • #34 Here&amp;apos;s an example from AFR where the I/O is sent to all the replicas. Remember the danger zone
  • #35 Not much else to say
  • #36 Note that this is in the cbk Xlators further down the call tree may return an error in op_ret and op_errno. E.g. The storage/posix xlator that actually writes to disk may have a real error, and this propagates back up. You could decide it&amp;apos;s not a real error— Conversely the lower xlators may return success and you could decide it is an error anyway— Either pass the parameters on, or change them accordingly. Recommend: use appropriate errno, or you risk confusing people
  • #37 And here&amp;apos;s how to handle some unrecoverable error, logic or otherwise, in the fop. Note the glusterfs idiom (retch)
  • #38 A basic translator isn&amp;apos;t hard. Trust me. ;-)