SlideShare a Scribd company logo
1 of 26
Rook-Deployed Scalable
NFS Clusters Exporting
CephFS
Jeff Layton
Principal Software Engineer
Patrick Donnelly
Senior Software Engineer
CephFS Team Lead
1
What is Ceph(FS)?
Rook-Deployed Scalable NFS Clusters Exporting CephFS
2
● CephFS is a POSIX distributed file
system.
● Clients and MDS cooperatively
maintain a distributed cache of
metadata including inodes and
directories
● MDS hands out capabilities (aka
caps) to clients, to allow them
delegated access to parts of inode
metadata
● Clients directly perform file I/O on
RADOS
Client Active
MDS
Journal
Metadata
Mutations
Standby
MDS
Active
MDS
Journal
RADOSData Pool
Metadata
Pool
read
write Journal Flush
Metadata Exchange
Ceph, an integral component of hybrid clouds
Rook-Deployed Scalable NFS Clusters Exporting CephFS
3
Manila:
● A file share service for VMs
● Uses CephFS to provision shared volumes.
Cinder:
● A block device provisioner for VMs.
● Uses Ceph’s RADOS Block Device (RBD).
CephFS usage in community Openstack
Rook-Deployed Scalable NFS Clusters Exporting CephFS
4
● Most Openstack users are also running a
Ceph cluster already
● Open source storage solution
● CephFS metadata scalability is ideally
suited to cloud environments.
https://www.openstack.org/user-survey/survey-
2017
So why Kubernetes?
Rook-Deployed Scalable NFS Clusters Exporting CephFS
5
Lightweight Containers!
● Trivial to spin up services in response to changing
application needs. Extensible service infrastructure!
Parallelism
● Containers are lightweight enough for lazy and
optimal parallelism!
Fast/Cheap Failover
● Service failover only requires a new pod.
Fast IP Failover/Management
● Filesystem shared between multiple nodes
● ...but also tenant-aware
Why Would You Need a
Ceph/NFS Gateway?
Rook-Deployed Scalable NFS Clusters Exporting CephFS
6
Clients that can’t speak Ceph properly
● old, questionable, or unknown ceph drivers
(old kernel)
● 3rd party OSs
Security
● Partition Ceph cluster from less trusted
clients
● GSSAPI (kerberos)
Openstack Manila
● Filesystem shared between multiple nodes
● ...but also tenant-aware
● ...and self-managed by tenant admins
Ceph
RADOS
OSDOSD
MDS
ceph-fuse
kernel
Active/Passive Deployments
Rook-Deployed Scalable NFS Clusters Exporting CephFS
7
● One active server at a time that “floats”
between physical hosts
● Traditional “failover” NFS server,
running under pacemaker/corosync
● Scales poorly + requires idle resources
● Available since Ceph Luminous (~Aug
2017)
Goal: Active/Active Deployment
Rook-Deployed Scalable NFS Clusters Exporting CephFS
8
NFS Servers NFS Clients
Goals and Requirements
Rook-Deployed Scalable NFS Clusters Exporting CephFS
9
Scale Out
Active/Active cluster of mostly
independent servers. Keep
coordination between them to a
bare minimum.
Containerizable
Leverage container orchestration
technologies to simplify
deployment and handle
networking. No failover of
resources. Just rebuild containers
from scratch when they fail.
NFSv4.1+
Avoid legacy NFS protocol
versions, allowing us to rely on new
protocol features for better
performance, and possibility for
pNFS later.
Ceph/RADOS for
Communication
Avoid need for any 3rd party
clustering or communication
between cluster nodes. Use Ceph
and RADOS to coordinate
Ganesha NFS Server
Rook-Deployed Scalable NFS Clusters Exporting CephFS
10
● Open-source NFS server that runs in userspace (LGPLv3)
● Plugin interface for exports and client recovery databases
○ well suited for exporting userland filesystems
● FSAL_CEPH uses libcephfs to interact with Ceph cluster
● Can use librados to store client recovery records and
configuration files in RADOS objects
● Amenable to containerization
○ Store configuration and recovery info in RADOS
○ No need for writeable local fs storage
○ Can run in unprivileged container
○ Rebuild server from r/o image if it fails
NFS Protocol
Rook-Deployed Scalable NFS Clusters Exporting CephFS
11
● Based on ONC-RPC (aka SunRPC)
● Early versions (NFSv2 and v3) were stateless
○ with sidecar protocols to handle file locking (NLM and NSM)
● NFSv4 was designed to be stateful, and state is leased to the client
○ client must contact server at least once every lease period to renew (45-60s is typical)
● NFSv4.1 revamped protocol using a sessions layer to provide exactly-once semantics
○ added RECLAIM_COMPLETE operation (allows lifting grace period early)
○ more clustering and migration support
● NFSv4.2 mostly new features on top of v4.1
NFS Client Recovery
Rook-Deployed Scalable NFS Clusters Exporting CephFS
12
● After restart, NFS servers come up with no ephemeral state
○ Opens, locks, delegations and layouts are lost
○ Allow clients to reclaim state they held earlier for ~2 lease periods
○ Detailed state tracking on stable storage is quite expensive
○ Ganesha had support for storing this in RADOS for single-server configurations
● During the grace period:
○ No new state can be established. Clients may only reclaim old state.
○ Allow reclaim only from clients present at time of crash
■ Necessary to handle certain cases involving network partitions
● Must keep stable-storage records of which clients are allowed to reclaim after a reboot
○ Prior to a client doing its first OPEN, set a stable-storage record for client if there isn’t one
○ Remove after last file is closed, or when client state expires
○ Atomically replace old client db with new just prior to ending grace period
NFS Server Reboot Epochs
Rook-Deployed Scalable NFS Clusters Exporting CephFS
13
Grace
Period
Normal
Ops
Grace
Period
Normal
Ops
Grace
Period
Normal
Ops
Epoch 1 Epoch 2 Epoch 3
● Consider each reboot the start of a new epoch
○ As clients perform first open (reclaim or regular), set a record for them
in a database associated with current epoch
○ During grace period, allow clients that are present in the previous
epoch’s database to reclaim state
○ After transition from Grace to Normal operations, can delete any
previous epoch’s database
● The same applies in a cluster of servers!
New in Ganesha: Coordinating Grace Periods
Rook-Deployed Scalable NFS Clusters Exporting CephFS
14
● Done via central database stored in a RADOS object
● Two 64-bit integers:
○ Current Epoch (C): Where should new records be stored?
○ Recovery Epoch (R): From what epoch should clients be allowed to reclaim?
○ R == 0 means grace period is not in force (normal operation)
● Flags for each node indicating current need for a grace period and enforcement status
○ NEED (N): does this server currently require a grace period?
■ First node to set its NEED flag declares a new grace period
■ Last node to clear its NEED flag can lift the grace period
○ ENFORCING (E): is this node enforcing the grace period?
■ If all servers are enforcing then we know no conflicting state can be handed out
(grace period is fully in force)
Simple logic to allow individual hosts to make decisions about grace enforcement based on
state of all servers in cluster. No running centralized entity.
Challenge: Layering NFS over a Cluster FS
Rook-Deployed Scalable NFS Clusters Exporting CephFS
15
● Both protocols are stateful, and use lease-based mechanism. Ganesha acts as a proxy.
● Example: ganesha server may crash and a new one started:
○ Client issues request to do a open NFS reclaim
○ Ganesha performs an open request via libcephfs
○ Blocks, waiting for caps held by previous ganesha instance
○ Default MDS session timeout is 60s
○ Session could also die before other servers are enforcing grace period
New In Nautilus: Client Reclaim of Sessions
Rook-Deployed Scalable NFS Clusters Exporting CephFS
16
● When a CephFS client dies abruptly, the MDS
keeps its session around for a little while
(several minutes). Stateful info (locks, opens,
caps, etc.) are tied to those sessions.
● Added new interfaces to libcephfs to allow
ganesha to request that its old session be
killed off immediately.
● May eventually add ability for session to take
over old stateful objects, obviating need for
grace period on surviving server heads.
https://tracker.ceph.com/issues/36397
Challenge: Deploy and Manage NFS Clusters
Rook-Deployed Scalable NFS Clusters Exporting CephFS
17
● Dynamic NFS cluster creation/deletion
○ Scale in two directions:
■ active-active configuration of NFS clusters for a hot subvolumes
■ per-subvolume independent NFS clusters
● Handling failures + IP migration
○ Spawn replacement NFS-Ganesha daemon and migrate the IP address.
○ Note: NFS clients cannot modify the NFS server’s IP address post-mount
Tying it all together with Rook and
Kubernetes
Rook-Deployed Scalable NFS Clusters Exporting CephFS
18
● rook.io is cloud-native storage orchestrator that
can deploy ceph clusters dynamically.
○ Ceph aware of deployment technology!
● Rook operator handles cookie-cutter
configuration of daemons
○ CephNFS resource type in k8s
● Integration with orchestrator ceph-mgr module
○ Deploy new clusters
○ Scale node count up or down
○ Dashboard can add+remove exports
● Can scale NFS cluster size up or down (MDS too
in the future!)
Ceph Nautilus: Mechanisms in Place
Rook-Deployed Scalable NFS Clusters Exporting CephFS
19
● Volume management integrated in Ceph to provide a SSOT for CephFS exports.
○ Volumes provide an abstraction for CephFS file systems.
○ Subvolumes provide an abstraction for indepdent CephFS directory trees.
○ Manila/Rook/<X> all use the same interface.
● Ceph/Rook integration in place to launch NFS-Ganesha clusters to export
subvolumes.
● Ceph management of exports and configs for NFS-Ganesha clusters.
Vision for the Future
Rook-Deployed Scalable NFS Clusters Exporting CephFS
20
● Trivial creation/config of a managed volume with a persistent NFS Gateway.
● NFS-Ganesha:
○ Add support for NFSv4 migration to allow redistribution of clients among
servers
○ Optimizing grace periods for subvolumes.
● SMB integration (samba)
$ ceph fs subvolume create vol_a subvol_1
$ ceph fs subvolume set vol_a subvol_1 <nfs-config…>
$ ceph fs subvolume set vol_a subvol_1 sharenfs true
14.2.2
14.2.3+
21
Jeff Layton jlayton@redhat.com
Patrick Donnelly pdonnell@redhat.com
Blog Demo: https://ceph.com/community/deploying-a-cephnfs-server-cluster-with-rook/
New in Nautilus: CephFS Improvements: https://ceph.com/community/nautilus-cephfs/
Rook: https://rook.io/
Ceph: https://ceph.com/
NFS Ganesha: https://www.nfs-ganesha.org/
Thank you
OPTIONALSECTIONMARKERORTITLE
22
Appendix
CephFS and Client Access
Rook-Deployed Scalable NFS Clusters Exporting CephFS
23
Journal
& Inodes
Ceph
RADOS
OSDOSD
MDS
ceph-fuse
kernel
● CephFS is a POSIX distributed file
system.
● Clients and MDS cooperatively
maintain a distributed cache of
metadata including inodes and
directories
● MDS hands out capabilities (aka
caps) to clients, to allow them
delegated access to parts of inode
metadata
● Clients directly perform file I/O on
RADOS
24
Kubernetes Pod (HA Managed by Kubernetes)
MDS
OSDOSDOSD
GaneshaNFSGW
MGR
Manila
/usr/bin
/ceph
Share:
CephFS Name
Export Paths
Network Share (e.g. Neutron
ID+CIDR)
Share Server Count
Rook/Kubernetes +
Kuryr (net
driver)
HA managed
by Kubernetes
Scale-out & shares
managed
by mgr
CephFS
Rook-Deployed Scalable NFS Clusters Exporting CephFS
25
● Metadata is handled via a Metadata Server (MDS)
○ clients establish a session with the MDS
● CephFS MDS hands out capabilities (aka caps) to clients
○ recallable, delegated, granular parts of inode metadata:
■ AUTH - uid, gid, mode
■ LINK - link count
■ FILE - read/write ability, layout, file size, mtime
■ XATTR - xattrs
○ most come in shared/exclusive flavors
○ grant the clients the ability to do certain operations on an inode
○ tied to the client’s MDS session
● libcephfs provides a (somewhat POSIX-ish) userland client interface
26
EOS

More Related Content

What's hot

Leases and-caching final
Leases and-caching finalLeases and-caching final
Leases and-caching finalGluster.org
 
Lcna 2012-tutorial
Lcna 2012-tutorialLcna 2012-tutorial
Lcna 2012-tutorialGluster.org
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGluster.org
 
Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Roger Zhou 周志强
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onGluster.org
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receiveMatthew Ahrens
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challengesGluster.org
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)Roger Zhou 周志强
 
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized StoreGlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized StoreAtin Mukherjee
 
Gluster for sysadmins
Gluster for sysadminsGluster for sysadmins
Gluster for sysadminsGluster.org
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsHan Zhou
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...Tommy Lee
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonSage Weil
 
Red Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSRed Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSbipin kunal
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalabGluster.org
 
Gluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmapGluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmapGluster.org
 
20160130 Gluster-roadmap
20160130 Gluster-roadmap20160130 Gluster-roadmap
20160130 Gluster-roadmapGluster.org
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific DashboardCeph Community
 

What's hot (20)

Leases and-caching final
Leases and-caching finalLeases and-caching final
Leases and-caching final
 
Lcna 2012-tutorial
Lcna 2012-tutorialLcna 2012-tutorial
Lcna 2012-tutorial
 
Glusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_cliftGlusterfs for sysadmins-justin_clift
Glusterfs for sysadmins-justin_clift
 
Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015 Linux High Availability Overview - openSUSE.Asia Summit 2015
Linux High Availability Overview - openSUSE.Asia Summit 2015
 
Lisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-onLisa 2015-gluster fs-hands-on
Lisa 2015-gluster fs-hands-on
 
OpenZFS send and receive
OpenZFS send and receiveOpenZFS send and receive
OpenZFS send and receive
 
Linux clustering solution
Linux clustering solutionLinux clustering solution
Linux clustering solution
 
Sdc 2012-challenges
Sdc 2012-challengesSdc 2012-challenges
Sdc 2012-challenges
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized StoreGlusterD 2.0 - Managing Distributed File System Using a Centralized Store
GlusterD 2.0 - Managing Distributed File System Using a Centralized Store
 
Gluster for sysadmins
Gluster for sysadminsGluster for sysadmins
Gluster for sysadmins
 
Gluster d2
Gluster d2Gluster d2
Gluster d2
 
Large scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutionsLarge scale overlay networks with ovn: problems and solutions
Large scale overlay networks with ovn: problems and solutions
 
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...Gluster fs tutorial   part 2  gluster and big data- gluster for devs and sys ...
Gluster fs tutorial part 2 gluster and big data- gluster for devs and sys ...
 
Community Update at OpenStack Summit Boston
Community Update at OpenStack Summit BostonCommunity Update at OpenStack Summit Boston
Community Update at OpenStack Summit Boston
 
Red Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFSRed Hat Gluster Storage : GlusterFS
Red Hat Gluster Storage : GlusterFS
 
Disperse xlator ramon_datalab
Disperse xlator ramon_datalabDisperse xlator ramon_datalab
Disperse xlator ramon_datalab
 
Gluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmapGluster fs current_features_and_roadmap
Gluster fs current_features_and_roadmap
 
20160130 Gluster-roadmap
20160130 Gluster-roadmap20160130 Gluster-roadmap
20160130 Gluster-roadmap
 
2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard2021.02 new in Ceph Pacific Dashboard
2021.02 new in Ceph Pacific Dashboard
 

Similar to Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)

CEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILA
CEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILACEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILA
CEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILACeph Community
 
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...TomBarron
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH Ceph Community
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016John Spray
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyCeph Community
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Community
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Gluster.org
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCeph Community
 
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...xKinAnx
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introductionkanedafromparis
 
Openstack overview thomas-goirand
Openstack overview thomas-goirandOpenstack overview thomas-goirand
Openstack overview thomas-goirandOpenCity Community
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackSage Weil
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Junping Du
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateDataWorks Summit
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and MetricsRicardo Lourenço
 
Integrating CloudStack & Ceph
Integrating CloudStack & CephIntegrating CloudStack & Ceph
Integrating CloudStack & CephShapeBlue
 
final proposal-Xen based Hypervisor in a Box
final proposal-Xen based Hypervisor in a Boxfinal proposal-Xen based Hypervisor in a Box
final proposal-Xen based Hypervisor in a BoxParamkusham Shruthi
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsNeependra Khare
 
Deep dive into OpenStack storage, Sean Cohen, Red Hat
Deep dive into OpenStack storage, Sean Cohen, Red HatDeep dive into OpenStack storage, Sean Cohen, Red Hat
Deep dive into OpenStack storage, Sean Cohen, Red HatSean Cohen
 
Deep Dive into Openstack Storage, Sean Cohen, Red Hat
Deep Dive into Openstack Storage, Sean Cohen, Red HatDeep Dive into Openstack Storage, Sean Cohen, Red Hat
Deep Dive into Openstack Storage, Sean Cohen, Red HatCloud Native Day Tel Aviv
 

Similar to Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy) (20)

CEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILA
CEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILACEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILA
CEPH DAY BERLIN - PRACTICAL CEPHFS AND NFS USING OPENSTACK MANILA
 
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
Practical CephFS with nfs today using OpenStack Manila - Ceph Day Berlin - 12...
 
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH CEPH DAY BERLIN - WHAT'S NEW IN CEPH
CEPH DAY BERLIN - WHAT'S NEW IN CEPH
 
CephFS update February 2016
CephFS update February 2016CephFS update February 2016
CephFS update February 2016
 
What's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon ValleyWhat's New with Ceph - Ceph Day Silicon Valley
What's New with Ceph - Ceph Day Silicon Valley
 
Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development Ceph Day London 2014 - The current state of CephFS development
Ceph Day London 2014 - The current state of CephFS development
 
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
Deploying pNFS over Distributed File Storage w/ Jiffin Tony Thottan and Niels...
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
Ibm spectrum scale fundamentals workshop for americas part 5 spectrum scale_c...
 
Ippevent : openshift Introduction
Ippevent : openshift IntroductionIppevent : openshift Introduction
Ippevent : openshift Introduction
 
Openstack overview thomas-goirand
Openstack overview thomas-goirandOpenstack overview thomas-goirand
Openstack overview thomas-goirand
 
The State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStackThe State of Ceph, Manila, and Containers in OpenStack
The State of Ceph, Manila, and Containers in OpenStack
 
Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017Hadoop 3 @ Hadoop Summit San Jose 2017
Hadoop 3 @ Hadoop Summit San Jose 2017
 
Apache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community UpdateApache Hadoop 3.0 Community Update
Apache Hadoop 3.0 Community Update
 
Scaling Up Logging and Metrics
Scaling Up Logging and MetricsScaling Up Logging and Metrics
Scaling Up Logging and Metrics
 
Integrating CloudStack & Ceph
Integrating CloudStack & CephIntegrating CloudStack & Ceph
Integrating CloudStack & Ceph
 
final proposal-Xen based Hypervisor in a Box
final proposal-Xen based Hypervisor in a Boxfinal proposal-Xen based Hypervisor in a Box
final proposal-Xen based Hypervisor in a Box
 
Performance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fsPerformance characterization in large distributed file system with gluster fs
Performance characterization in large distributed file system with gluster fs
 
Deep dive into OpenStack storage, Sean Cohen, Red Hat
Deep dive into OpenStack storage, Sean Cohen, Red HatDeep dive into OpenStack storage, Sean Cohen, Red Hat
Deep dive into OpenStack storage, Sean Cohen, Red Hat
 
Deep Dive into Openstack Storage, Sean Cohen, Red Hat
Deep Dive into Openstack Storage, Sean Cohen, Red HatDeep Dive into Openstack Storage, Sean Cohen, Red Hat
Deep Dive into Openstack Storage, Sean Cohen, Red Hat
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentationphoebematthew05
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
costume and set research powerpoint presentation
costume and set research powerpoint presentationcostume and set research powerpoint presentation
costume and set research powerpoint presentation
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 

Kubecon shanghai rook deployed nfs clusters over ceph-fs (translator copy)

  • 1. Rook-Deployed Scalable NFS Clusters Exporting CephFS Jeff Layton Principal Software Engineer Patrick Donnelly Senior Software Engineer CephFS Team Lead 1
  • 2. What is Ceph(FS)? Rook-Deployed Scalable NFS Clusters Exporting CephFS 2 ● CephFS is a POSIX distributed file system. ● Clients and MDS cooperatively maintain a distributed cache of metadata including inodes and directories ● MDS hands out capabilities (aka caps) to clients, to allow them delegated access to parts of inode metadata ● Clients directly perform file I/O on RADOS Client Active MDS Journal Metadata Mutations Standby MDS Active MDS Journal RADOSData Pool Metadata Pool read write Journal Flush Metadata Exchange
  • 3. Ceph, an integral component of hybrid clouds Rook-Deployed Scalable NFS Clusters Exporting CephFS 3 Manila: ● A file share service for VMs ● Uses CephFS to provision shared volumes. Cinder: ● A block device provisioner for VMs. ● Uses Ceph’s RADOS Block Device (RBD).
  • 4. CephFS usage in community Openstack Rook-Deployed Scalable NFS Clusters Exporting CephFS 4 ● Most Openstack users are also running a Ceph cluster already ● Open source storage solution ● CephFS metadata scalability is ideally suited to cloud environments. https://www.openstack.org/user-survey/survey- 2017
  • 5. So why Kubernetes? Rook-Deployed Scalable NFS Clusters Exporting CephFS 5 Lightweight Containers! ● Trivial to spin up services in response to changing application needs. Extensible service infrastructure! Parallelism ● Containers are lightweight enough for lazy and optimal parallelism! Fast/Cheap Failover ● Service failover only requires a new pod. Fast IP Failover/Management ● Filesystem shared between multiple nodes ● ...but also tenant-aware
  • 6. Why Would You Need a Ceph/NFS Gateway? Rook-Deployed Scalable NFS Clusters Exporting CephFS 6 Clients that can’t speak Ceph properly ● old, questionable, or unknown ceph drivers (old kernel) ● 3rd party OSs Security ● Partition Ceph cluster from less trusted clients ● GSSAPI (kerberos) Openstack Manila ● Filesystem shared between multiple nodes ● ...but also tenant-aware ● ...and self-managed by tenant admins Ceph RADOS OSDOSD MDS ceph-fuse kernel
  • 7. Active/Passive Deployments Rook-Deployed Scalable NFS Clusters Exporting CephFS 7 ● One active server at a time that “floats” between physical hosts ● Traditional “failover” NFS server, running under pacemaker/corosync ● Scales poorly + requires idle resources ● Available since Ceph Luminous (~Aug 2017)
  • 8. Goal: Active/Active Deployment Rook-Deployed Scalable NFS Clusters Exporting CephFS 8 NFS Servers NFS Clients
  • 9. Goals and Requirements Rook-Deployed Scalable NFS Clusters Exporting CephFS 9 Scale Out Active/Active cluster of mostly independent servers. Keep coordination between them to a bare minimum. Containerizable Leverage container orchestration technologies to simplify deployment and handle networking. No failover of resources. Just rebuild containers from scratch when they fail. NFSv4.1+ Avoid legacy NFS protocol versions, allowing us to rely on new protocol features for better performance, and possibility for pNFS later. Ceph/RADOS for Communication Avoid need for any 3rd party clustering or communication between cluster nodes. Use Ceph and RADOS to coordinate
  • 10. Ganesha NFS Server Rook-Deployed Scalable NFS Clusters Exporting CephFS 10 ● Open-source NFS server that runs in userspace (LGPLv3) ● Plugin interface for exports and client recovery databases ○ well suited for exporting userland filesystems ● FSAL_CEPH uses libcephfs to interact with Ceph cluster ● Can use librados to store client recovery records and configuration files in RADOS objects ● Amenable to containerization ○ Store configuration and recovery info in RADOS ○ No need for writeable local fs storage ○ Can run in unprivileged container ○ Rebuild server from r/o image if it fails
  • 11. NFS Protocol Rook-Deployed Scalable NFS Clusters Exporting CephFS 11 ● Based on ONC-RPC (aka SunRPC) ● Early versions (NFSv2 and v3) were stateless ○ with sidecar protocols to handle file locking (NLM and NSM) ● NFSv4 was designed to be stateful, and state is leased to the client ○ client must contact server at least once every lease period to renew (45-60s is typical) ● NFSv4.1 revamped protocol using a sessions layer to provide exactly-once semantics ○ added RECLAIM_COMPLETE operation (allows lifting grace period early) ○ more clustering and migration support ● NFSv4.2 mostly new features on top of v4.1
  • 12. NFS Client Recovery Rook-Deployed Scalable NFS Clusters Exporting CephFS 12 ● After restart, NFS servers come up with no ephemeral state ○ Opens, locks, delegations and layouts are lost ○ Allow clients to reclaim state they held earlier for ~2 lease periods ○ Detailed state tracking on stable storage is quite expensive ○ Ganesha had support for storing this in RADOS for single-server configurations ● During the grace period: ○ No new state can be established. Clients may only reclaim old state. ○ Allow reclaim only from clients present at time of crash ■ Necessary to handle certain cases involving network partitions ● Must keep stable-storage records of which clients are allowed to reclaim after a reboot ○ Prior to a client doing its first OPEN, set a stable-storage record for client if there isn’t one ○ Remove after last file is closed, or when client state expires ○ Atomically replace old client db with new just prior to ending grace period
  • 13. NFS Server Reboot Epochs Rook-Deployed Scalable NFS Clusters Exporting CephFS 13 Grace Period Normal Ops Grace Period Normal Ops Grace Period Normal Ops Epoch 1 Epoch 2 Epoch 3 ● Consider each reboot the start of a new epoch ○ As clients perform first open (reclaim or regular), set a record for them in a database associated with current epoch ○ During grace period, allow clients that are present in the previous epoch’s database to reclaim state ○ After transition from Grace to Normal operations, can delete any previous epoch’s database ● The same applies in a cluster of servers!
  • 14. New in Ganesha: Coordinating Grace Periods Rook-Deployed Scalable NFS Clusters Exporting CephFS 14 ● Done via central database stored in a RADOS object ● Two 64-bit integers: ○ Current Epoch (C): Where should new records be stored? ○ Recovery Epoch (R): From what epoch should clients be allowed to reclaim? ○ R == 0 means grace period is not in force (normal operation) ● Flags for each node indicating current need for a grace period and enforcement status ○ NEED (N): does this server currently require a grace period? ■ First node to set its NEED flag declares a new grace period ■ Last node to clear its NEED flag can lift the grace period ○ ENFORCING (E): is this node enforcing the grace period? ■ If all servers are enforcing then we know no conflicting state can be handed out (grace period is fully in force) Simple logic to allow individual hosts to make decisions about grace enforcement based on state of all servers in cluster. No running centralized entity.
  • 15. Challenge: Layering NFS over a Cluster FS Rook-Deployed Scalable NFS Clusters Exporting CephFS 15 ● Both protocols are stateful, and use lease-based mechanism. Ganesha acts as a proxy. ● Example: ganesha server may crash and a new one started: ○ Client issues request to do a open NFS reclaim ○ Ganesha performs an open request via libcephfs ○ Blocks, waiting for caps held by previous ganesha instance ○ Default MDS session timeout is 60s ○ Session could also die before other servers are enforcing grace period
  • 16. New In Nautilus: Client Reclaim of Sessions Rook-Deployed Scalable NFS Clusters Exporting CephFS 16 ● When a CephFS client dies abruptly, the MDS keeps its session around for a little while (several minutes). Stateful info (locks, opens, caps, etc.) are tied to those sessions. ● Added new interfaces to libcephfs to allow ganesha to request that its old session be killed off immediately. ● May eventually add ability for session to take over old stateful objects, obviating need for grace period on surviving server heads. https://tracker.ceph.com/issues/36397
  • 17. Challenge: Deploy and Manage NFS Clusters Rook-Deployed Scalable NFS Clusters Exporting CephFS 17 ● Dynamic NFS cluster creation/deletion ○ Scale in two directions: ■ active-active configuration of NFS clusters for a hot subvolumes ■ per-subvolume independent NFS clusters ● Handling failures + IP migration ○ Spawn replacement NFS-Ganesha daemon and migrate the IP address. ○ Note: NFS clients cannot modify the NFS server’s IP address post-mount
  • 18. Tying it all together with Rook and Kubernetes Rook-Deployed Scalable NFS Clusters Exporting CephFS 18 ● rook.io is cloud-native storage orchestrator that can deploy ceph clusters dynamically. ○ Ceph aware of deployment technology! ● Rook operator handles cookie-cutter configuration of daemons ○ CephNFS resource type in k8s ● Integration with orchestrator ceph-mgr module ○ Deploy new clusters ○ Scale node count up or down ○ Dashboard can add+remove exports ● Can scale NFS cluster size up or down (MDS too in the future!)
  • 19. Ceph Nautilus: Mechanisms in Place Rook-Deployed Scalable NFS Clusters Exporting CephFS 19 ● Volume management integrated in Ceph to provide a SSOT for CephFS exports. ○ Volumes provide an abstraction for CephFS file systems. ○ Subvolumes provide an abstraction for indepdent CephFS directory trees. ○ Manila/Rook/<X> all use the same interface. ● Ceph/Rook integration in place to launch NFS-Ganesha clusters to export subvolumes. ● Ceph management of exports and configs for NFS-Ganesha clusters.
  • 20. Vision for the Future Rook-Deployed Scalable NFS Clusters Exporting CephFS 20 ● Trivial creation/config of a managed volume with a persistent NFS Gateway. ● NFS-Ganesha: ○ Add support for NFSv4 migration to allow redistribution of clients among servers ○ Optimizing grace periods for subvolumes. ● SMB integration (samba) $ ceph fs subvolume create vol_a subvol_1 $ ceph fs subvolume set vol_a subvol_1 <nfs-config…> $ ceph fs subvolume set vol_a subvol_1 sharenfs true 14.2.2 14.2.3+
  • 21. 21 Jeff Layton jlayton@redhat.com Patrick Donnelly pdonnell@redhat.com Blog Demo: https://ceph.com/community/deploying-a-cephnfs-server-cluster-with-rook/ New in Nautilus: CephFS Improvements: https://ceph.com/community/nautilus-cephfs/ Rook: https://rook.io/ Ceph: https://ceph.com/ NFS Ganesha: https://www.nfs-ganesha.org/ Thank you OPTIONALSECTIONMARKERORTITLE
  • 23. CephFS and Client Access Rook-Deployed Scalable NFS Clusters Exporting CephFS 23 Journal & Inodes Ceph RADOS OSDOSD MDS ceph-fuse kernel ● CephFS is a POSIX distributed file system. ● Clients and MDS cooperatively maintain a distributed cache of metadata including inodes and directories ● MDS hands out capabilities (aka caps) to clients, to allow them delegated access to parts of inode metadata ● Clients directly perform file I/O on RADOS
  • 24. 24 Kubernetes Pod (HA Managed by Kubernetes) MDS OSDOSDOSD GaneshaNFSGW MGR Manila /usr/bin /ceph Share: CephFS Name Export Paths Network Share (e.g. Neutron ID+CIDR) Share Server Count Rook/Kubernetes + Kuryr (net driver) HA managed by Kubernetes Scale-out & shares managed by mgr
  • 25. CephFS Rook-Deployed Scalable NFS Clusters Exporting CephFS 25 ● Metadata is handled via a Metadata Server (MDS) ○ clients establish a session with the MDS ● CephFS MDS hands out capabilities (aka caps) to clients ○ recallable, delegated, granular parts of inode metadata: ■ AUTH - uid, gid, mode ■ LINK - link count ■ FILE - read/write ability, layout, file size, mtime ■ XATTR - xattrs ○ most come in shared/exclusive flavors ○ grant the clients the ability to do certain operations on an inode ○ tied to the client’s MDS session ● libcephfs provides a (somewhat POSIX-ish) userland client interface

Editor's Notes

  1. Rook was developed as a storage provider for Kubernetes to automatically deploy and attach storage to pods. Significant effort within Rook has been devoted to integrating the open-source storage platform Ceph with Kubernetes. Ceph is a distributed storage system in broad use today that presents unified file, block, and object interfaces to applications. This talk will present completed work in the Ceph Nautilus release to dynamically create highly-available and scalable NFS server clusters that export the Ceph file system (CephFS) for use within Kubernetes or as a standalone appliance. CephFS provides applications with a friendly programmatic interface for creating shareable volumes. For each volume, Ceph and Rook cooperatively manage the details of dynamically deploying a cluster of NFS-Ganesha pods with minimal operator or user involvement. Tuesday June 25, 2019 14:20 - 14:55 609 KC+CNC - Storage https://www.lfasiallc.com/events/kubecon-cloudnativecon-china-2019/schedule-english/
  2. Early efforts to create an HPC file system giving nodes unencumbered access to the data path.
  3. Origins of this effort are in Openstack and Manila.
  4. So how does Kubernetes fit in? That will become clear towards the end!
  5. Not necessarily missing from Openstack...
  6. Those exact goals run counter to desired I/O access in some scenarios.
  7. Early efforts to create an HPC file system giving nodes unencumbered access to the data path.