AFM Configuration in GPFS or Spectrum Scale .pptx

g66b7yer1 217 views 53 slides Sep 10, 2024
Slide 1
Slide 1 of 53
Slide 1
1
Slide 2
2
Slide 3
3
Slide 4
4
Slide 5
5
Slide 6
6
Slide 7
7
Slide 8
8
Slide 9
9
Slide 10
10
Slide 11
11
Slide 12
12
Slide 13
13
Slide 14
14
Slide 15
15
Slide 16
16
Slide 17
17
Slide 18
18
Slide 19
19
Slide 20
20
Slide 21
21
Slide 22
22
Slide 23
23
Slide 24
24
Slide 25
25
Slide 26
26
Slide 27
27
Slide 28
28
Slide 29
29
Slide 30
30
Slide 31
31
Slide 32
32
Slide 33
33
Slide 34
34
Slide 35
35
Slide 36
36
Slide 37
37
Slide 38
38
Slide 39
39
Slide 40
40
Slide 41
41
Slide 42
42
Slide 43
43
Slide 44
44
Slide 45
45
Slide 46
46
Slide 47
47
Slide 48
48
Slide 49
49
Slide 50
50
Slide 51
51
Slide 52
52
Slide 53
53

About This Presentation

AFM Configuration


Slide Content

Active File Management (AFM) Bringing data together across clusters

Unit objectives After completing this unit, you should be able to Describe Home and Cache features List the various AFM modes Create and manage an AFM relationship

Overview © Copyright IBM Corporation 2014

GPFS introduced concurrent file system access from multiple nodes. Multi-cluster expands the global namespace by connecting multiple sites AFM takes global namespace truly global by automatically managing asynchronous replication of data GPFS GPFS GPFS GPFS GPFS GPFS 1993 2005 2011 Evolution of the global namespace: AFM Active file management (AFM)

AFM basics Cache basics Data updates are asynchronous Writes can continue when the WAN is unavailable Two sides to a cache relationship Home Where the information lives Cache Data is copied to the cache when requested Data written to the cache is copied back to home as quickly as possible Multiple cache relationships per file system Cache relationships are at a fileset level A file system can contain multiple homes, caches and non-cached data Multiple caching modes to meet your needs Read-Only Single Writer Cache-Wins High Availability © Copyright IBM Corporation 2013 Home Cache

Synchronous operations (cache validate/miss) On a cache miss, pull attrs and create on demand (lookup, open, …) In case, where cache is setup with empty home, there shouldn’t be any sync ops. On a later data read Whole file is fetched over NFS and written locally Data read is done in parallel across multiple nodes Application can continue after required data is in cache while the remaining file is being fetched On a cache hit Attributes are revalidated based on revalidation delay If data hasn’t changed it is read locally On a disconnected mode access Data access to cached data will fetch local data Files not cached are returned as not existing (ENOENT )

Asynchronous Updates (write, create, remove) Updates at cache site are pushed back lazily Mask the latency of the WAN Data is written to GPFS at cache site synchronously Writeback is asynchronous Configurable asynch delay Writeback coalesce updates and accommodate out-of-order and parallel writes Filter I/o as needed (Rewrites to same blocks) Admin can force a sync if needed mmafmctl -- flushPending

AFM mode: Read-only caching Read caching mode Data exists on the home fileset and one or more cache sites Data is moved to the cache on-demand. File Metadata caching: Listing the contents of a directory moves the file metadata information into the cache Data – Opening a file copies the data in the cache Getting data to the cache On-demand when opened Pre-fetch using a GPFS policy Pre-fetch using a list of files Caching behavior Many to one Optional LRU cleaning of cache Cascading caches © Copyright IBM Corporation 2013 Cascading Cache RO One to Many RO RO RO RO RO

AFM mode: Single-writer Data written to a cache Asynchronous replication back to home Can have multiple read-only caches © Copyright IBM Corporation 2013 RW Single-Writer RO RO RW

AFM mode: Independent Writer Multiple cache nodes All nodes can write data Conflict resolution Default: The last writer wins © Copyright IBM Corporation 2013 RW Cache Wins RW RW RW RW

AFM mode: Asynchronous replication (TL1) Asynchronous Replication in HA pair Cache site does the writing Home site is failover Cache fails New cache can be defined Home fails New Home can be defined © Copyright IBM Corporation 2013 High Availability RW RO

AFM Modes Single Writer Only cache can write data. Home can’t change. Peer cache needs to be setup as read only Read Only Cache can only read data, no data change allowed. Local Update Data is cached from home and changes are allowed like SW mode but changes are not pushed to home. Once data is changed the relationship is broken i.e cache and home are no longer in sync for that file. Independent Writer Data can change at home and any caches Different caches can change different files Changing Modes SW, IW & RO mode cache can be changed to any other mode LU cache can’t be changed

Communication between AFM clusters Communication is done using NFSv3 Already tested with NFSv4 Architecture is designed to support future protocols GPFS has it’s own NFSv3 client Automatic recovery in case of a communication failure Parallel data transfers (even for a single file) Transfers extended attributes and ACL’s Additional benefits Standard protocol can leverage standard WAN accelerators Any NFSv3 server can be a “Home” © Copyright IBM Corporation 2013 Communications Connect to any NFS Server

© Copyright IBM Corporation 2011 Global namespace © Copyright IBM Corporation 2013

© Copyright IBM Corporation 2011 Non GPFS in global namespace © Copyright IBM Corporation 2013

Cache Operations © Copyright IBM Corporation 2014

Pre-fetching Prefetch Files selectively from home to cache Runs asynchronously in the background Parallel multi-node prefetch (4.1) Metadata-only without fetching files (4.1) User exit when completed Choose to files to prefetch based on policy

AFM is on disk managed data Data is managed like a cache but stored on disk in a GPFS file system. Duration of data in a cache is dependent on configuration No cache cleaning ( afmAllowEviction ) Set duration of data in cache as good ( afmExpirationTimeou t ). © Copyright IBM Corporation 2013

Cache Eviction Use when Cache smaller than home Data fills up in cache faster than it can be pushed to home. Need to create space for caching other files or space for incoming writes. Eviction is linked with fileset quotas . For RO fileset cache eviction is triggered automatically When fileset usage level goes above fileset soft quota limits Chooses files based on LRU Files with unsynched data are not evicted Eviction can be disabled It can be triggered manually mmafmctl Device evict -j FilesetName

Non- Posix data over NFS No POSIX attributes ACLs , Xattrs , Sparse files Can be transferred using AFM protocol over NFS To enable on the home exported fileset root directory mmafmconfig At the cache site First access checks for existence of this special file and logs a message if not accessible User Ids should be maintained across sites as there is no mapping done by AFM

Expiration of Data Staleness Control Defined based on time since disconnection Once cache is expired, no access is allowed to cache Manual expire/ unexpire option for admin Mmafmctl – expire/ unexpire Allowed only for ro mode cache

GPFS AFM: Definitions Node types Application node Writes/reads data based on application request to the GPFS file system at the cache cluster Can be Linux or AIX Gateway node(s) is the node that connects to the home cluster Reads/writes data from the home cluster to the cache cluster Checks connectivity with the home cluster and changes to disconnected mode on connection outage Triggers recovery on failure Only Linux supported. Sites Home cluster Exports a fileset that can be cached Cache cluster Runs Panache and “connects” a local fileset with the home fileset . Transport Protocol NFSv3. © Copyright IBM Corporation 2013

AFM Data The following are cached/replicated between home & cache File data Directory structure ACLs & Extended Attributes Sparse files Single Gateway node per fileset Parallelism on large file I/O (Multiple threads and multiple nodes – 4.1) All metadata ops queued on the same gateway node for that fileset . © Copyright IBM Corporation 2014

GPFS AFM: Disconnected state Each GW node monitors connectivity to the home cluster(s) Ping thread sends NULL RPCs for each home for each of the active filesets on that GW node. Gateway node goes into disconnected mode after ping times out Informs all gateway nodes of the disconnection for that home All GW nodes will mark that home to be disconnected. On ping thread detecting reconnection Each GW node that was disconnected will inform the lead GW node of reconnection. Lead GW node informs all GW nodes on reconnection if all GW nodes are reconnected. Requires external trigger to move to reconnection if some GW nodes never reconnect but others can. © Copyright IBM Corporation 2013

Independent Writer A Closer Look

Independent Writer Multiple cache filesets can write to single home, as long as each cache writes to different files . The multiple cache sites do re-validation periodically and pull the new data from home . In case, multiple cache filesets write to same file, the sequence of updates is un-deterministic . The writes are pushed to home as they come in independently becuase there is no locking between clusters.

IW Cache Home Moving Applications The application can be moved from cache to home Only If application semantics allows them to work with incomplete data. A special procedure needs to be followed for moving application from cache to home and back to cache. Unplanned / Planned

Unplanned Application Movement Cache site goes down while pending changes are not synced Note down the time at home, this time is required during failback when application is moved back to cache. Unlink fileset at cache. Move Application To Home. Once cache is back, run failback cmd to resolve differences, prefetch new data and open up for application.

New in GPFS 4.1 GPFS Backend using GPFS multi-cluster Parallel I/O Using multiple threads and multiple nodes per file Better handling of GW node failures Various usability improvements

Restrictions Hard links Hard links at home are not detected Creation of hard link in cache are maintained The following are NOT supported/cached/replicated Clones Special files like sockets or device files Fileset metadata like quotas, replication parameters, snapshots etc. Renames Renames at home will result in remove/create in cache Locking is restricted to cache cluster only Independent Filesets only (NO file system level AFM setup) Dependent filesets can’t be linked into AFM filesets Peer snapshots are supported only in SW mode

Disaster Recovery Coming in GPFS 4.1.1 aka TL2

NAS client AFM (configured as primary) AFM (configured as secondary) Push all updates asynchronously Client switches to secondary on failure AFM based DR – 4.1 ++ Supported in TL2 Replicate data from primary to secondary site. The relationship is Active-Passive (primary – RW, secondary – RO) Allow primary to operate actively with no interruption when the relationship with secondary fails Automatic failback when primary comes back Granularity at fileset level RPO and RTO support – from min to hours (depends on data rate change/link bw etc ) Not supported in TL2 No Cascading mode aka no teritiary and only one secondary allowed per relationship Posix only ops, no appendOnly support No file system level support Continue with present limitation of not allowing to link dependent filset inside panache fileset No Metadata replication (dependent filesets , user snapshots, fileset quotas, user quotas, replication factor, other fileset attributes, support direct io setting)

Take a fileset snapshot at master Mark the point in time in the “write-back queue” Push all updates upto point in time marker Take a snapshot of fileset at the replica Update mgmnt tool state of last snapshot time and ids AFM (configured as home-replica) Push all updates asynchronously Continuous replication with snapshot support AFM (configured as home-master) 1 1 3 Multi site snapshot mgmnt tool SONAS-SPARK 2 1 2 3 Snapshot at cache and home correspond to same point in time Psnap Consistent Replication

DR Configuration Establish primary-secondary relationship Create AFM fileset at primary and associate with the DRSecondary fileset Provides DRPrimaryID that should be used when setting up DRSecondary Initialization phase Truck the data from the primary to secondary if necessary Initial Trucking can be done via AFM or out of band (customer choosen method like tape etc ) Normal operation Async replication will continuously push data to secondary based on asyncDelay Psnap support to get common consistency points between primary and secondary. Done periodically based on RPO

On DR Event Primary Failure Promote Secondary to DRPrimary Restore data from last consistency point (RPO snapshot) Secondary Failure Establish a new secondary mmafmctl – setNewSecondary Takes a initial snapshot and pushes data to new secondary in the background RPO snapshot will start after intial sync Failback to old primary Restore to last RPO snapshot(similar to whats done on secondary during its promotion to primary) Find changes made at secondary and apply back to original primary. Incremental or once Needs down time in the last iteration to avoid any more changes Revert the primary secondary modes

Set-up and Tuning © Copyright IBM Corporation 2014

© Copyright IBM Corporation 2011 Setting up AFM On the home Create NFS export Set Home export configuration ( mmafmhomeconfig ) On the cache Define one or more Gateway nodes Create cache fileset © Copyright IBM Corporation 2013

© Copyright IBM Corporation 2008 Creating a cache Cache is defined at the fileset level mmcrfileset command Usage: mmcrfileset Device FilesetName [-- inode -space=new [-- inode -limit= MaxNumInodes [: NumInodesToPreallocate ]] | -- inode -space= ExistingFileset ] [-p Attr =Value[, Attr =Value...]...] [-t Comment] Example: mmcrfileset cache2 master_t1 -p afmTarget = nfsnode :/ gpfs /m1/m_t1 -p afmMode = cw -- inode -space=new © Copyright IBM Corporation 2013

© Copyright IBM Corporation 2011 Controlling AFM Usage: mmafmctl Device { resync | cleanup | expire | unexpire } -j FilesetName or mmafmctl Device { getstate | flushPending | resumeRequeued } [-j FilesetName ] or mmafmctl Device failover -j FilesetName --new-target NewAfmTarget [-s LocalWorkDirectory ] or mmafmctl Device prefetch -j FilesetName [[-- inode -file PolicyListFile ] | [--list-file ListFile ]] [-s LocalWorkDirectory ] or mmafmctl Device evict -j FilesetName [--safe-limit SafeLimit ] [--order {LRU | SIZE}] [--log-file LogFile ] [--filter Attribute=Value ...] © Copyright IBM Corporation 2013

AFM native GPFS protocol support GPFS 4.1 enables native GPFS protocol support in place of NFS when using AFM Native GPFS protocol utilizes the remote file system mount over multi-cluster to function as the AFM target. This requires a multi-cluster setup to exist between the home and cache before AFM can use the home cluster’s file system mount on the remote cluster AFM will work with any file system on the home cluster, but ACL’s, extended attributes, and sparse files are only supported when the home file system is GPFS Note: This is true whether using NFS or GPFS The mmafmconfig command is used to enable native GPFS protocol support © Copyright IBM Corporation 2013

AFM and parallel I/O The gateway server acting as the metadata server is the channel for communication with the home cluster. GPFS 4.1 now allows a cache cluster to be set up to perform parallel I/O and leverage all gateway servers in a cluster. Multiple NFS servers are required at the home cluster and each gateway node in the cache cluster gets mapped to a specific NFS server at home—allowing for I/O load distribution One or more gateway nodes can be mapped to an NFS server, but each gateway server can only map to one NFS server Mapping is configured via the mmafmconfig command If native GPFS protocol support is in place, gateway nodes can be mapped to any other node in the same cache cluster In the absence of a mapping definition, all gateway nodes will be used for the I/O © Copyright IBM Corporation 2013

AFM parallel I/O example This configuration assumes cache gateway servers hs22n18-21 and NSF servers js22n01-02 # mmafmconfig add js22n01 --export-map js22n01/hs22n18,js22n02/hs22n19 mmafmconfig : Command successfully completed mmafmconfig : Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. # mmafmconfig add js22n02 --export-map js22n02/hs22n20,js22n01/hs22n21 mmafmconfig : Command successfully completed mmafmconfig : Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. © Copyright IBM Corporation 2013

The mmafmconfig command The mmafmconfig command can be used to display, delete, or update mappings. New changes take place only after fileset re-link or file system remount. Gateway designation can only be removed from a node if the node is not participating in an active mapping. # mmafmconfig show Map name: js22n01 Export server map: 192.168.200.12/hs22n19.gpfs.net,192.168.200.11/hs22n18.gpfs.net Map name: js22n02 Export server map: 192.168.200.11/hs22n20.gpfs.net,192.168.200.12/hs22n21.gpfs.net © Copyright IBM Corporation 2013

Parallel reads and writes Parallel reads and writes must be configured separately. These will be effective for files that are larger than those specified by the threshold. Parallel read or write thresholds are defined by the following parameters: afmParallelWriteThreshold afmParallelReadThreshold © Copyright IBM Corporation 2013

Parallel I/O Process The metadata server of a cache fileset communicates with participating gateway nodes based on availability. Each I/O request is split into chunks Chunks are sent across all gateway nodes in parallel Chunk size is configurable using the following parameters: afmParallelWriteChunkSize afmParallelReadChunkSize Gateway nodes communicate with their mapped NFS server and get a reply. If multiple gateways are mapped to the same NFS server at home, only one gateway node will be chosen to perform the read task. Writes will be split among all the gateway nodes The process is different for native GPFS protocol AFM setups Additional configuration parameters for Parallel I/O are available in the Advanced Administrators Guide © Copyright IBM Corporation 2013

AFM parameters Set using mmchconfig , mmcrfileset , mmchfileset mmchconfig parameters are global defaults Fileset level setting override defaults AFM Tuning Options are dynamic Mmchfileset afm options (- p afmAttribute =Value ) afmAllowEviction afmAsyncDelay afmDirLookupRefreshInterval afmDirOpenRefreshInterval afmExpirationTimeout afmFileLookupRefreshInterval afmFileOpenRefreshInterval afmMode afmShowHomeSnapshot © Copyright IBM Corporation 2013

Tunable Parameters Revalidation timeout afmFileLookupRefreshInterval = <0, MAXINT> afmDirLookupRefreshInterval = <0, MAXINT> Async delay timeout set per fileset to push updates to home at a later time afmAsyncDelay = <0, MAXINT> Expiration timeout After disconnection, data in cache expires after this time afmExpirationTimeout = <0, MAXINT> Disconnect Timeout pingThread timeout for gateway node to go into disconnected mode afmDisconnectTimeout = <0, MAXINT> Parallel Read threshold afmParallelReadThreshold = <1GB,maxfilesize> Operating Modes (Caching behavior e.g., read-only default) afmMode = [read-only, single-writer, local-update ]

AFM data migration from NFS (1 of 2) GPFS 4.1 adds the capability to migrate data from an existing legacy storage appliance to a GPFS cluster via the NFS protocol. Once the migration is complete, the legacy storage can be disconnected IP switchover is also possible, making migration complete Progressive data migration with little or no downtime possible The data migration can be done via prefetch or dynamically based on demand. This minimizes downtime while moving data with attributes to its new home Consult the Advanced Administrator Guides for step-by-step details © Copyright IBM Corporation 2013

AFM data migration from NFS (2 of 2) Requirements Target hardware must be running GPFS 4.1 or later Data source should be NFS v3 export and can be GPFS or non-GPFS source. GPFS source earlier than 3.4 is equivalent to a non-GPFS source. Process for GPFS source All GPFS extended attributes, ACL’s, and sparse files maintained Quotas, snapshots, file system tuning parameters, policies, fileset definitions, encryption keys, dmapi parameters remain unmoved To keep hard links, prefetching must be done with –ns option AFM migrates data as root and bypasses permission checks, only actual data blocks are migrated based on file size Process for non-GPFS source POSIX permissions or ACL’s are pulled, no NFS V4/CIS features migrate © Copyright IBM Corporation 2013

AFM optimization GPFS 4.1 adds a number of optimizations to AFM: Prefetch enhancements are provided to enable gateway node failures during prefetch AFM now comes with a new version of hashing ( afmHashVersion =2 ) that minimizes the impact of gateway nodes joining or leaving the active cluster AFM cache states now have different states based on fileset and queue states. © Copyright IBM Corporation 2013

Review Different caching modes Cache is not transient, real file in real file system © Copyright IBM Corporation 2013 © Copyright IBM Corporation 2008

Exercise 7 Active File M anagement Exercise © Copyright IBM Corporation 2013

Unit summary Having completed this unit, you should be able to: Describe Home and Cache features List the various AFM modes Create and manage an AFM relationship © Copyright IBM Corporation 2013
Tags