Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Stretched Cluster and Active File Management

Spectrum Scale 4.1 System Administration
Spectrum Scale
Active File Management (AFM)
Bringing data together across clusters

Unit objectives
After completing this unit, you should be able to
• Describe the value of Active File Management (AFM)
• Describe Home and Cache Relationship & Features
• Understand some Client Leveraged Use Cases
• List the various AFM modes and Relationship
• Learn how to create and manage an AFM relationship

OVERVIEW
© Copyright IBM Corporation 2015

Spectrum Scale
introduced
concurrent file
system access from
multiple nodes.
Multi-cluster expands the global
namespace by connecting multiple
sites
AFM takes global namespace truly
global by automatically managing
asynchronous replication of data
GPFS
GPFS
GPFS
GPFS
GPFS
GPFS
1993 2005 2011
Evolution of the global namespace: AFM
• Active file management (AFM)

5
 IBM Spectrum Scale central site can be a
source where data is created, maintained,
updated/changed.
 Central site can push data to edge sites for
WAN optimization
 Remote sites can periodically pre-fetch (via
policy) or pull on demand
 Data is revalidated when accessed
(staleness check)
 Remote sites can be primary (write) owners
and send data back to the central site
 Central Data Site has all the directories and
backup/HSM will be managed out of this
site.
 Local or long distance users sharing a
dedicated home file system but with
individual home directory
IBM Active File Manager (AFM): unique 21st Century Advanced Global Functionality
On Demand
Push or pull
Can read
or write
 Central office / Branch office
 Ingest / disseminate
 Collaboration in the Cloud
Backup Integration
Central Data
Site
Single Global Name Space
With Global Distribution
Will be
refreshed
Revalidated
copy
By establishing an automated relationship between clusters,
Access to files from anywhere, as if they were local.

Client Use case of AFM with Mixed clusters

AFM Relationship Basics
• Cache basics
– Data updates are asynchronous
– Writes can continue even when the WAN is unavailable
• Two sides to a cache relationship
– Home
• Where the information lives
– Cache
• Data is copied to the cache when requested
• Data written to the cache is then copied back to home as quickly as possible
• Multiple cache relationships per file system
– Cache relationships are at a fileset level
– A file system can contain multiple homes, caches and non-cached data
• Multiple caching modes to meet your needs
– Read-Only
– Single Writer
– Cache-Wins
– High Availability
Home
Cache

Synchronous operations (cache validate/miss)
• On a cache miss, we pull attrs and create it local “on demand” (lookup, open, …)
• In case, where cache is setup with empty home, there shouldn’t be any sync ops.
• On a later data read
• Whole file is fetched over NFS and written locally
• The Data read is done in parallel across multiple nodes
• Applications can continue after required data is in cache while the remaining file is
being fetched
• On a cache hit
• The Attributes are revalidated based on revalidation delay
• If data hasn’t changed it is read locally
• On a disconnected node access
• Data access to cached data will fetch local data only
• Files not cached are returned as a not existing error (ENOENT)
• Files written locally will do a lazy sync back to the home site when reconnected.

Asynchronous Updates (write, create, remove)
• Updates at the cache site are pushed back lazily (Async)
• The Masks the latency of the WAN
• Data is written to Spectrum Scale at cache site synchronously
• Writeback is asynchronous
• We do provide Configurable asynch delay
• Writeback coalesces updates and accommodates out-of-order
and parallel writes
• Filter I/O as needed (Rewrites to same blocks)
• The Admin can force a sync if needed
mmafmctl --flushPending

Active File Management (AFM) Cache
The other side of the relationship is called
the Home or Target (same thing, two
names). A cache is a property of a fileset
and is defined when you create the fileset.
AFM Cache Facts
 There is one Home relationship per cache files
 The relationship between a Cache and Home is one to one – All a cache knows about is
it’s Home. A Home does not know a cache even exists.
 The cache does all the work – The cache checks the Home for changes and sends
updates to the Home. How a cache behaves is determined by the cache mode.
 There are four cache modes
 Read-Only (ro),
 Local-Update (lu),
 Single-Writer (sw)
 Independent Writer (iw).
Calling this a “cache” may be selling it a little short. Inode and file data in a cache
fileset is the same as an inode and file data in any Spectrum Scale file system. It is a
“real” file stored on disk, the job of the cache is to keep the data in the file
consistent, at some level, with the data on the other side of the relationship.
The Home Site

AFM mode: Read-only caching
• Read caching mode
– Data exists on the home fileset and one or
more cache sites
• Data is moved to the cache on-demand.
– File Metadata caching: Listing the contents of a
directory moves the file metadata information
into the cache
– Data – Opening a file copies the data in the
cache
– Getting data to the cache
• On-demand when opened
• Pre-fetch using a Spectrum Scale policy
• Pre-fetch using a list of files
• Caching behavior
– Many to one
– Optional LRU cleaning of cache
– Cascading caches
Cascading Cache
RO
One to Many
RO RO
RO RO RO

AFM mode: Single-writer
• Data is written to a cache
• And Asynchronous replication pushes it back to the home
• You Can have multiple read-only caches
RW
Single-Writer
RO RORW

AFM mode: Independent Writer
• You have Multiple cache nodes
• All nodes can write data
• With Conflict resolution
– Default: The last writer wins
• This will apply to the Home or the Cache
RW
Cache Wins
RW
RW RWRW

AFM mode: Asynchronous replication (TL1)
• Asynchronous Replication in HA pair
– Cache site does the writing
– Home site is failover
• If Cache fails
– A New cache can be defined
• If Home fails
– A New Home can be defined
High Availability
RW
RO

Notes on AFM Modes
• Single Writer
– Only cache can write data. Home can’t change.
– Peer cache needs to be setup as read only
• Read Only
– Cache site can only read data, no data change allowed from the cahce.
• Local Update
– Data is cached from the home and changes are allowed like if SW mode however, changes are
not pushed back to the home.
– Once data is changed the relationship of that data is broken i.e cache and home are no longer in
sync for that file.
• Independent Writer
– Data can change at the home and at any caches
– Different caches can change different files
• Changing Modes
– SW, IW & RO mode cache can be changed to any other mode
– LU cache can’t be changed (because it assumes data will be different)

Communication between AFM clusters
• Communication is done using NFSv3
– Already tested with NFSv4
– Architecture is designed to support future protocols
• Spectrum Scale has it’s own NFSv3 client
– Automatic recovery in case of a communication failure
– Parallel data transfers (even for a single file)
– Transfers extended attributes and ACL’s
• Additional benefits
– Standard protocol can leverage standard WAN accelerators
– Any NFSv3 server can be a “Home”
Communications
Connect to any
NFS Server

1
Network Infrastructure
• AFM uses the following network services on specific ports
– Check and make sure that the network infrastructure and firewalls have the following
ports open between the clusters
– 1081 for HTTP, 22 for SSH, 2049 for NFS, 32767 for NFS mount.
Tips:
– In any network, there can be man-in-the-middle firewalls, blocked ports, and/or port
mapping within the infrastructure
– Watch out for situations where port mapping could change if the port goes idle for a
period of time, or where a firewall may close a port due to port idle.
– Plan for and allow time to research, coordinate, find and resolve any of these types of
networking issues.
– Other than these requirements, AFM will run on a standard network infrastructure that
supports NFSv3
– Allow time for network admins to apply standard TCP/IP tuning expertise, such as setting
window sizes and tuning network buffers
• Confirm that ssh logon to remote sites is acceptable
– AFM requires ssh logon to remote sites. Cannot use AFM if ssh not acceptable.

Global namespace
Writer for
data3 and
data4
Writer
for
Data1
and
Data2
Writer for
data5 and
data6

Non Spectrum Scale in global namespace

CACHE OPERATIONS

Pre-fetching (data is proactively populated)
• Prefetch Files selectively from home to cache
• Runs asynchronously in the background
• Parallel multi-node prefetch (new in 4.1)
• Metadata-only without fetching files (new in 4.1)
• User exit when completed
• You can choose the files to prefetch based on the policy
For Example:
Make a file list using a simple LIST RULE via policy if the home is GPFS,
or using find or ls -lR or any similar tool, and feed this file list to
mmafmctl --prefetch --ns.
This will populate the directory tree in the fileset.
The administrator can migrate either some files selectively or all files
using mmafmctl --prefetch --filelist.

AFM is on-disk managed data
• Data is managed like a cache but stored on disk in a Spectrum
Scale file system.
• Duration of data in a cache is dependent on configuration
– No cache cleaning (afmAllowEviction)
– Set duration of data in cache as good (afmExpirationTimeout).
Put the data where you want it, and
keep it there as long as you need it.

Cache Eviction (data on cache is expired / removed)
• Use when
– Cache smaller than home
– Data fills up in cache faster than it can be pushed to home.
– Need to create space for caching other files or space for incoming writes.
– Eviction is linked with fileset quotas.
• For RO fileset cache eviction is triggered automatically
– When fileset usage level goes above fileset soft quota limits
– Chooses files based on LRU
– Files with unsynched data are not evicted
• Eviction can be disabled
• It can be triggered manually
mmafmctl Device evict -j FilesetName
afmEnableAutoEviction
This AFM configuration attribute enables eviction on a given fileset. A yes value specifies that
eviction is allowed on the fileset. A no value specifies that eviction is not allowed on the fileset.

Cache States
Active
This state indicates that the cache is active and ready for operations.
Dirty
This state indicates that there are pending changes in cache that are not yet played at home. This
state will not hamper user function, and the user can continue normal activity on the cache.
Disconnected
This state can occur only in a cache that is created over NFS export. It occurs when the MDS cannot
connect to the NFS server at home. When parallel I/O is configured, this state shows the connectivity
between the MDS and the mapped home server, irrespective of other gateway nodes. See Parallel I/O
for more details. To come out of this state to Active state, the administrator must correct the errant
NFS server or servers on the home cluster.
Dropped
A cache fileset state moves to Dropped when there are problems in the cache during recovery or IW
failback processes. The problems can include the local file system being full, no space on the cache,
or policy failure during recovery. The administrator must rectify the issue and retry recovery or failback.
The cache needs to be accessed to re-trigger recovery and to reissue the failback; it must be manually
run using mmafmctl.
Dropped state is also possible under the following conditions:
•When a cache with active queue operations is forcibly unlinked. While all queued operations
are being dequeued, the fileset remains in dropped state and moves to inactive state when the
unlinking is complete. There is no administrative action required for this temporary dropped
state.
•While AFM internally performs queue transfers from one gateway to another to handle gateway
node failures. This usually gets rectified automatically on the next accessing of the cache.
The mmafmctl getstate command displays the current cache state

Cache States (continued)
Expired
This state can also occur only in a cache that is created over NFS export. It occurs when an
expiration timeout is set on a fileset. All requests in the queue are dropped. To come out of this
state to Active state, the administrator must correct the errant NFS server or servers on the
home cluster.
FailbackCompleted
This state occurs when a failback has successfully completed on a cache fileset. The
administrator must run mmafmctl failback stop to move the cache to Active state.
FailbackInProgress
This state occurs when a failback process has been initiated on a cache fileset and it is in
progress. This will move automatically to FailbackCompleted once the failback process has
completed.
FailoverInProgress
This state shows that the cache is in the middle of a failover process. This will move to Active
when the failover is complete.
FlushOnly
This state indicates that operations are queued but have not started to flush. This is a
temporary state and should move to Active when a write is initiated.
Inactive
This state is possible when a fileset has been just created and operations have not been
initiated on the fileset. This should move to Active once operations begin.

Cache States (continued)
NeedsFailback
This state can occur when a failback initiated on a cache is interrupted and is incomplete.
Failback will automatically get triggered on the fileset, or the administrator can rerun failback.
NeedsResync
This state occurs when there is some accidental corruption of data of a single writer home.
The mmafmctl resync command needs to be run on the fileset to move it to Active state.
Queueonly
A cache fileset is moved to QueueOnly when operations at the cache are queued but not yet
flushed, since operations from recovery, resync, failover are in the process of getting flushed to
home. This state is temporary and the user can continue normal activity.
Recovery
A cache is said to be in recovery state when it is running the recovery or resync process and is
in the process of queueing or flushing the updates. This state will not hamper normal user
function, and the state will move to Active once recovery or resync is complete.
Unmounted:
Caches using NFS as transport protocol will go into Unmounted state if home NFS is not
accessible, or if home exports are not exported properly or home export does not exist. The
problem with home exports must be resolved. After 300 seconds, the cache will retry
connecting with home and will move to Active state.

Expiration of Data (preventing access to stale data)
• Staleness Control
• Defined based on time since disconnection
• Once cache is expired, no access is allowed to cache
• Manual expire/unexpire option for admin
• mmafmctl –expire/unexpire
• Allowed only for ro mode cache
This prevents access to stale data, where
staleness is defined by the amount of time
that the WAN cache is out of synchronization
with data at the home site.

Spectrum Scale AFM: Definitions
• Node types
– Application node
• Writes/reads data based on application request to the Spectrum Scale file
system at the cache cluster
– Can be Linux or AIX
• Gateway node(s) is the node that connects to the home cluster
– Reads/writes data from the home cluster to the cache cluster
– Checks connectivity with the home cluster and changes to disconnected
mode on connection outage
– Triggers recovery on failure
– Only Linux supported.
• Sites
– Home cluster
• Exports a fileset that can be cached
– Cache cluster
• Runs Panache and “connects” a local fileset with the home fileset.
• Transport Protocol
– NFSv3.

AFM Data (what is transferred)
• The following are cached/replicated between home & cache
– File data
– Directory structure
– ACLs & Extended Attributes
– Sparse files
• Single Gateway node per fileset
– Parallelism on large file I/O (Multiple threads and multiple nodes – 4.1)
– All metadata ops queued on the same gateway node for that fileset.

Spectrum Scale AFM: Disconnected state
• Each GW node monitors connectivity to the home cluster(s)
– Ping thread sends NULL RPCs for each home for each of the active
filesets on that GW node.
• Gateway node goes into disconnected mode after ping times out
– Informs all gateway nodes of the disconnection for that home
– All GW nodes will mark that home to be disconnected.
• On ping thread detecting reconnection
– Each GW node that was disconnected will inform the lead GW node of
reconnection.
– Lead GW node informs all GW nodes on reconnection if all GW nodes are
reconnected.
– Requires external trigger to move to reconnection if some GW nodes
never reconnect but others can.

INDEPENDENT WRITER
A Closer Look

Independent Writer
• Multiple cache filesets can write to single home, as long as
each cache writes to different files.
• The multiple cache sites do re-validation periodically and pull
the new data from home.
• In case, multiple cache filesets write to same file, the sequence
of updates is un-deterministic.
• The writes are pushed to home as they come in independently
because there is no locking between clusters.
• Use Case: (As with unique users at each site updating files in
their home directory).

New in Spectrum Scale 4.1
• Spectrum Scale Backend using Spectrum Scale multi-cluster
• Parallel I/O
• Using multiple threads and multiple nodes per file
• Better handling of GW node failures
• Various usability improvements

Some Restrictions to Consider
• Hard links
• Hard links at home are not detected
• Creation of hard link in cache are maintained
• The following are NOT supported/cached/replicated
• Clones
• Special files like sockets or device files
• Fileset metadata like quotas, replication parameters, snapshots etc.
• Renames
• Renames at home will result in remove/create in cache
• Locking is restricted to cache cluster only
• Independent Filesets only (NO file system level AFM setup)
• Dependent filesets can’t be linked into AFM filesets
• Peer snapshots are supported only in SW mode

DISASTER RECOVERY
Coming in Spectrum Scale 4.1.1 aka TL2

NAS client
AFM
(configured
as primary)
AFM
(configured as
secondary)
Push all updates
asynchronously
Client
switches to
secondary on
failure
AFM based DR – 4.1++
Supported in TL2
• Replicate data from primary to secondary site.
• The relationship is Active-Passive (primary – RW,
secondary – RO)
• Allow primary to operate actively with no interruption when
the relationship with secondary fails
• Automatic failback when primary comes back
• Granularity at fileset level
• RPO and RTO support – from min to hours
(depends on data rate change/link bw etc)
Not supported in TL2
• No Cascading mode aka no teritiary and only one
secondary allowed per relationship
• Posix only ops, no appendOnly support
• No file system level support
• Continue with present limitation of not allowing to link
dependent filset inside panache fileset
• No Metadata replication (dependent filesets, user
snapshots, fileset quotas, user quotas, replication factor,
other fileset attributes, support direct io setting)

Take a fileset
snapshot at master
Mark the point in
time in the “write-
back queue”
Push all updates
upto point in time
marker
Take a snapshot of
fileset at the replica
Update mgmnt tool
state of last snapshot
time and ids
AFM
(configured as home-
replica)
Push all updates
asynchronously
Continuous replication
with snapshot support
AFM
(configured as
home-master)
1
1
3
Multi site
snapshot
mgmnt tool
SONAS-SPARK
2
1
2
3
Snapshot at cache and home
correspond to same point in time
Psnap Consistent Replication

Basics to DR Configuration
• Establish primary-secondary relationship
• Create AFM fileset at primary and associate with the DR Secondary fileset
• Provides DRPrimaryID that should be used when setting up
DRSecondary
• Initialization phase
• Truck the data from the primary to secondary if necessary
• Initial Trucking can be done via AFM or out of band (customer choosen
method like tape etc)
• Normal operation
• Async replication will continuously push data to secondary based on
asyncDelay
• Psnap support to get common consistency points between primary and
secondary.
• Done periodically based on RPO

On DR Event (here is what happens)
• Primary Failure
– Promote Secondary to DRPrimary
– Restore data from last consistency point (RPO snapshot)
• Secondary Failure
– Establish a new secondary
– mmafmctl –setNewSecondary
– Takes a initial snapshot and pushes data to new secondary in the background
– RPO snapshot will start after intial sync
• Failback to old primary
– Restore to last RPO snapshot(similar to whats done on secondary during its promotion to
primary)
– Find changes made at secondary and apply back to original primary.
– Incremental or once
– Needs down time in the last iteration to avoid any more changes
– Revert the primary secondary modes

SET-UP AND TUNING

Setting up AFM (high level steps)
• On the home
– Create NFS export
– Set Home export configuration (mmafmhomeconfig)
• On the cache
– Define one or more Gateway nodes
– Create cache fileset

Creating a cache
• Cache is defined at the fileset level
• mmcrfileset command
• Usage:
mmcrfileset Device FilesetName
[--inode-space=new [--inode-
limit=MaxNumInodes[:NumInodesToPreallocate]]
| --inode-space=ExistingFileset]
[-p Attr=Value[,Attr=Value...]...]
[-t Comment]
• Example:
mmcrfileset cache2 master_t1 -p
afmTarget=nfsnode:/gpfs/m1/m_t1 -p
afmMode=cw --inode-space=new

Controlling AFM (building Cache behavior)
• Usage:
mmafmctl Device {resync | cleanup | expire | unexpire} -j
FilesetName
or
mmafmctl Device {getstate | flushPending | resumeRequeued}
[-j FilesetName]
or
mmafmctl Device failover -j FilesetName
--new-target NewAfmTarget [-s LocalWorkDirectory]
or
mmafmctl Device prefetch -j FilesetName
[[--inode-file PolicyListFile] | [--list-file
ListFile]]
[-s LocalWorkDirectory]
or
mmafmctl Device evict -j FilesetName
[--safe-limit SafeLimit] [--order {LRU | SIZE}]
[--log-file LogFile] [--filter Attribute=Value ...]

AFM native Spectrum Scale protocol support
• Spectrum Scale 4.1 enables native Spectrum Scale protocol support in
place of NFS when using AFM
• Native Spectrum Scale protocol utilizes the remote file system mount over
multi-cluster to function as the AFM target.
• This requires a multi-cluster setup to exist between the home and cache
before AFM can use the home cluster’s file system mount on the remote
cluster
• AFM will work with any file system on the home cluster, but ACL’s, extended
attributes, and sparse files are only supported when the home file system is
GPFS
• Note: This is true whether using NFS or GPFS
• The mmafmconfig command is used to enable native Spectrum Scale
protocol support

The mmafmconfig command
The mmafmconfig command can be used to display, delete, or
update mappings. New changes take place only after fileset
re-link or file system remount. Gateway designation can only
be removed from a node if the node is not participating in an
active mapping.
# mmafmconfig show
Map name: js22n01
Export server map:
192.168.200.12/hs22n19.gpfs.net,192.168.200.11/hs22n18.gpfs.net
Map name: js22n02
Export server map:
192.168.200.11/hs22n20.gpfs.net,192.168.200.12/hs22n21.gpfs.net

Parallel reads and writes
Parallel reads and writes must be configured separately. These
will be effective for files that are larger than those specified by
the threshold.
Parallel read or write thresholds are defined by the following
parameters:
• afmParallelWriteThreshold
• afmParallelReadThreshold

AFM parameters
• Set using mmchconfig, mmcrfileset, mmchfileset
– mmchconfig parameters are global defaults
– Fileset level setting override defaults
• AFM Tuning
– Options are dynamic
• Mmchfileset afm options (-p afmAttribute=Value)
– afmAllowEviction
– afmAsyncDelay
– afmDirLookupRefreshInterval
– afmDirOpenRefreshInterval
– afmExpirationTimeout
– afmFileLookupRefreshInterval
– afmFileOpenRefreshInterval
– afmMode
– afmShowHomeSnapshot

Tunable Parameters
• Revalidation timeout
– afmFileLookupRefreshInterval = <0, MAXINT>
– afmDirLookupRefreshInterval = <0, MAXINT>
• Async delay timeout
– set per fileset to push updates to home at a later time
– afmAsyncDelay = <0, MAXINT>
• Expiration timeout
– After disconnection, data in cache expires after this time
– afmExpirationTimeout = <0, MAXINT>
• Disconnect Timeout
– pingThread timeout for gateway node to go into disconnected mode
– afmDisconnectTimeout = <0, MAXINT>
• Parallel Read threshold
– afmParallelReadThreshold = <1GB,maxfilesize>
• Operating Modes (Caching behavior e.g., read-only default)
– afmMode = [read-only, single-writer, local-update]

AFM data migration from NFS (1 of 2)
Spectrum Scale 4.1 adds capabilities to migrate data from an legacy
storage
-NAS appliance to a Spectrum Scale cluster via the NFS protocol.
• Once the migration is complete, the legacy storage can be disconnected
• IP switchover is also possible, making migration complete
Progressive data migration with little or no downtime possible
• The data migration can be done via prefetch or dynamically based on
demand.
• This minimizes downtime while moving data with attributes to its new home
• Consult the Advanced Administrator Guides for step-by-step details

AFM data migration from NFS (2 of 2)
Requirements
Target hardware must be running Spectrum Scale 4.1 or later
Data source should be NFS v3 export and can be Spectrum Scale or non-Spectrum
Scale source. Spectrum Scale source earlier than 3.4 is equivalent to a non-
Spectrum Scale source.
Process for Spectrum Scale source
All Spectrum Scale extended attributes, ACL’s, and sparse files maintained
Quotas, snapshots, file system tuning parameters, policies, fileset definitions,
encryption keys, dmapi parameters remain unmoved
To keep hard links, prefetching must be done with –ns option
AFM migrates data as root and bypasses permission checks, only actual data blocks
are migrated based on file size
Process for non-Spectrum Scale source
POSIX permissions or ACL’s are pulled, no NFS V4/CIS features migrate

Review
• Different caching modes support different use cases
• Cache is not transient, real file in real file system
• IBM Spectrum Scale has a key advantage over competition
with AFM technology (note called Active Cloud Engine (ACE in
the SONAS platform).
• We should be confident in presenting this technology as a
differentiator in competitive bids.
• Our systems in MontPellier are available for IBMr’s and
Business partners to use for client demos and practice in using
these features.
© Copyright IBM Corporation 2013© Copyright IBM Corporation 2008

Any Questions on
Active File Management
Questions?

Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Stretched Cluster and Active File Management

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Stretched Cluster and Active File Management

Similar to Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Stretched Cluster and Active File Management (20)

More from xKinAnx

More from xKinAnx (20)

Recently uploaded

Recently uploaded (20)

Ibm spectrum scale fundamentals workshop for americas part 4 Replication, Stretched Cluster and Active File Management