APAR status
Closed as program error.
Error description
Error Description: A IW fileset might be stuck in Recovery/FlushOnly status when there are sync reads from client, and the reads take long time to complete (eg. the files is huge and migrated to tape at home) hence block recovery's snapshot create at cache. From the gateway, we can see there are two inflight reads from client nodes. # mmfsdm dump afm fset myfileset Fileset: myfileset 5 (gpfs1) mode: independent-writer queue: Blocked MDS: <c0n0> QMem 1108 CTL 311 home: mapping1 homeServer: 10.10.10.1 proto: nfs port: 2049 lastCmd: 16 handler: Mounted Dirty >> RecoveryInProg << refCount: 4 queueTransfer: state: Idle senderVerified: 0 receiverVerified: 1 terminate: 0 psnapWait: 0 remoteAttrs: AsyncLookups 0 tsfindinode: success 0 failed 0 totalTime 0.0 avgTime 0.000000 maxTime 0.0 queue: delay 90 QLen 2+0 flushThds 0 maxFlushThds 10 numExec 261076 qfs 0 iwo 0 err 5 handlerCreateTime : 2022-06-18_08:08:24.437+0800 numCreateSnaps : 0 InflightAsyncLookups 0 lastReplayTime : 2022-06-18_08:08:24.437+0800 lastSyncTime : 2022-06-18_08:08:24.437+0800 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 48 i/o: pReadChunkSize 134217728 pReadThresh: 858993459200 pWriteThresh: 858993459200 i/o: prefetchThresh 0 (Prefetch) Mnt status: 0:0 1:0 2:0 3:0 Export Map: 10.10.10.1/<c0n0> 10.10.10.2/<c0n1> iw: afmIwTakeoverTime 0 Priority Queue: Empty (state: Blocked) Normal Queue: (listed by execution order)(state:Blocked) 0 Read [203362323.203362323] inflight (3424 @ 32320233472) vIdx 3 tid 84708 1 Read [186342413.186342413] inflight (6972 @ 8009179136) vIdx 1 tid 136195 From gateway's GPFS log /var/adm/ras/mmfs.log.latest, we can see the fileset recovery failed because snapshot could not be created after quiesce timeout, and read from home failed with error 5 or error 78. 2022-06-18_08:34:59.590+0800: [I] Calling user exit script mmAfmRecoveryStart: event afmRecoveryStart, Async command /usr/lpp/mmfs/bin/mmsysmonc, filesystem gpfs1, fileset my fileset. 2022-06-18_08:35:00.807+0800: [N] AFM: Starting recovery for fileset 'myfileset' in fs 'gpfs1' 2022-06-18_08:35:00.819+0800: mmcommon afmrecovery invoked : device=gpfs1 filesetId=5 filesetName=myfileset resync=0 recoverySeq=23994175 snap=1 mode=6 fsType=0 drpsnap=0 2022-06-18_08:35:01.596+0800: [I] AFM: Cleaning up stale recovery snapshots 2022-06-18_08:35:11.782+0800: [D] Snapshot whole quiesce of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted 10076 msec. 2022-06-18_08:36:21.854+0800: [D] Snapshot whole quiesce of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted 10063 msec. 2022-06-18_08:37:32.002+0800: [D] Snapshot whole quiesce of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted 10098 msec. 2022-06-18_08:38:42.113+0800: [D] Snapshot whole quiesce of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted 10093 msec. 2022-06-18_08:39:52.232+0800: [D] Snapshot whole quiesce of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted 10099 msec. 2022-06-18_08:40:11.468+0800: [W] AFM: Home is taking longer than expected to respond for cache myfileset, mount index 3. 2022-06-18_08:40:21.475+0800: [W] AFM: Home is taking longer than expected to respond for cache myfileset, mount index 3. 2022-06-18_08:40:21.487+0800: [I] AFM: Previous error repeated 2 times. 2022-06-18_08:40:21.487+0800: [E] AFM: Read file system gpfs1 fileset myfileset file IDs [182398983.182398983. -1.-1,N] name / remote error 5 2022-06-18_08:40:21.487+0800: Input/output error 2022-06-18_08:41:02.350+0800: [D] Snapshot whole quiesce of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted 10094 msec. 2022-06-18_08:45:31.729+0800: [I] AFM: Previous error repeated 1 times. .................... 2022-06-18_08:45:51.745+0800: [E] AFM: Read file system gpfs1 fileset myfileset file IDs [186342413.186342413. -1.-1,T] name / remote error 78 2022-06-18_08:45:51.745+0800: Connection timed out 2022-06-18_08:45:51.745+0800: [W] AFM: Home is taking longer than expected to respond for cache myfileset, mount index 1. 2022-06-18_08:46:01.748+0800: [W] AFM: Home is taking longer than expected to respond for cache myfileset, mount index 1. 2022-06-18_08:46:01.758+0800: [I] AFM: Previous error repeated 1 times. 2022-06-18_08:46:01.759+0800: [E] AFM: Read file system gpfs1 fileset myfileset file IDs [186342413.186342413. -1.-1,N] name / remote error 5 2022-06-18_08:46:01.759+0800: Input/output error AFM: Failed to create snapshot for fileset myfileset to run recovery. Unable to quiesce all nodes; some processes are busy or holding required resources. 2022-06-18_08:46:02.465+0800: [I] AFM: Stopping recovery for the file system gpfs1 fileset myfileset 2022-06-18_08:46:04.469+0800: [I] Calling user exit script mmAfmRecoveryFail: event afmRecoveryFail, Async command /usr/lpp/mmfs/bin/mmsysmonc, filesystem gpfs1, fileset myfileset. 2022-06-18_08:46:04.475+0800: [E] AFM: Recovery on file system gpfs1 fileset myfileset failed with error 78. Recovery will be retried on next access after recovery retry interval (120 seconds) or manually resolve known problems and recover the fileset. Reported in: Spectrum Scale 5.1.1.4 on RHEL7 Known Impact: AFM fileset not active Verification steps: Recovery action: Stop the jobs triggering the slow reads from home, this will let recovery completed first.
Local fix
N/A
Problem summary
Synchronous on-demand Read which triggers recovery on Independent-Writer mode fileset can block recovery if the home file is in migrated state
Problem conclusion
This problem is fixed in 5.1.2 PTF6 To see all Spectrum Scale APARs and their respective fix solutions refer to page https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_ apars.html Benefits of the solution: Recovery not blocked Work around: Trigger recovery separately on IW fileset through ls or touch operations and then trigger such sync reads on uncached file which might be migrated to HSM at the home site. Problem trigger: Sync read being performed on IW fileset which needs recovery to be run and the file being migrated to HSM at the home site. Symptom: Performance Impact Platforms affected: All Linux OS Environments (Serving as AFM Gateway nodes) Functional Area affected: AFM Customer Impact: High importance
Temporary fix
Comments
APAR Information
APAR number
IJ41328
Reported component name
SPEC SCALE ADV
Reported component ID
5737F35AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-07-22
Closed date
2022-08-01
Last modified date
2022-08-01
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE ADV
Fixed component ID
5737F35AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
01 August 2022