IJ41328: AFM: FILESET STUCK IN FLUSHONLY, RECOVERY BLOCKED BY SYNC OP

APAR status

Closed as program error.

Error description

Error Description:

A IW fileset might be stuck in Recovery/FlushOnly status
when there are sync reads from client,  and the reads
take long time to complete (eg. the files is huge and
migrated to tape at home) hence block recovery's snapshot
create at cache.

From the gateway,  we can see there are two inflight
reads
 from client nodes.

# mmfsdm dump afm fset myfileset
Fileset: myfileset 5 (gpfs1)
  mode: independent-writer queue: Blocked  MDS: <c0n0>
QMem 1108 CTL 311
  home: mapping1 homeServer: 10.10.10.1 proto: nfs port:
2049   lastCmd: 16
  handler: Mounted Dirty  >> RecoveryInProg <<
refCount: 4
  queueTransfer: state: Idle senderVerified: 0
receiverVerified: 1 terminate: 0 psnapWait: 0
  remoteAttrs: AsyncLookups 0 tsfindinode: success 0
failed 0 totalTime 0.0 avgTime 0.000000 maxTime 0.0
  queue: delay 90 QLen 2+0 flushThds 0 maxFlushThds 10
numExec 261076 qfs 0 iwo 0 err 5
  handlerCreateTime : 2022-06-18_08:08:24.437+0800
numCreateSnaps : 0 InflightAsyncLookups 0
  lastReplayTime : 2022-06-18_08:08:24.437+0800
lastSyncTime : 2022-06-18_08:08:24.437+0800
  i/o: readBuf: 33554432 writeBuf: 2097152
sparseReadThresh: 134217728  pReadThreads 48
  i/o: pReadChunkSize 134217728 pReadThresh: 858993459200
  pWriteThresh: 858993459200
  i/o: prefetchThresh 0 (Prefetch)
  Mnt status: 0:0 1:0 2:0 3:0
  Export Map:  10.10.10.1/<c0n0>  10.10.10.2/<c0n1>
  iw: afmIwTakeoverTime 0
  Priority Queue:  Empty (state: Blocked)
  Normal Queue: (listed by execution
order)(state:Blocked)
       0 Read [203362323.203362323] inflight (3424 @
32320233472) vIdx 3 tid 84708
       1 Read [186342413.186342413] inflight (6972 @
8009179136) vIdx 1 tid 136195


From gateway's GPFS log /var/adm/ras/mmfs.log.latest, we
can see the fileset recovery failed because snapshot
could
 not be created after quiesce timeout, and read from home
 failed with error 5 or error 78.

2022-06-18_08:34:59.590+0800: [I] Calling user exit
script
 mmAfmRecoveryStart: event afmRecoveryStart, Async
command
 /usr/lpp/mmfs/bin/mmsysmonc, filesystem gpfs1, fileset
my
fileset.
2022-06-18_08:35:00.807+0800: [N] AFM: Starting recovery
for fileset 'myfileset' in fs 'gpfs1'
2022-06-18_08:35:00.819+0800: mmcommon afmrecovery
invoked
: device=gpfs1 filesetId=5 filesetName=myfileset resync=0
recoverySeq=23994175 snap=1 mode=6 fsType=0 drpsnap=0
2022-06-18_08:35:01.596+0800: [I] AFM: Cleaning up stale
recovery snapshots
2022-06-18_08:35:11.782+0800: [D] Snapshot whole quiesce
of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
10076 msec.
2022-06-18_08:36:21.854+0800: [D] Snapshot whole quiesce
of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
10063 msec.
2022-06-18_08:37:32.002+0800: [D] Snapshot whole quiesce
of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
10098 msec.
2022-06-18_08:38:42.113+0800: [D] Snapshot whole quiesce
of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
10093 msec.
2022-06-18_08:39:52.232+0800: [D] Snapshot whole quiesce
of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
10099 msec.
2022-06-18_08:40:11.468+0800: [W] AFM: Home is taking
longer than expected to respond for cache myfileset,
mount index 3.
2022-06-18_08:40:21.475+0800: [W] AFM: Home is taking
longer than expected to respond for cache myfileset,
mount
 index 3.
2022-06-18_08:40:21.487+0800: [I] AFM: Previous error
repeated 2 times.
2022-06-18_08:40:21.487+0800: [E] AFM: Read file system
gpfs1 fileset myfileset file IDs [182398983.182398983.
-1.-1,N] name / remote error 5
2022-06-18_08:40:21.487+0800: Input/output error
2022-06-18_08:41:02.350+0800: [D] Snapshot whole quiesce
of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
10094 msec.
2022-06-18_08:45:31.729+0800: [I] AFM: Previous error
repeated 1 times.
....................
2022-06-18_08:45:51.745+0800: [E] AFM: Read file system
gpfs1 fileset myfileset file IDs [186342413.186342413.
-1.-1,T] name / remote error 78
2022-06-18_08:45:51.745+0800: Connection timed out
2022-06-18_08:45:51.745+0800: [W] AFM: Home is taking
longer than expected to respond for cache myfileset,
mount
 index 1.
2022-06-18_08:46:01.748+0800: [W] AFM: Home is taking
longer than expected to respond for cache myfileset,
mount
 index 1.
2022-06-18_08:46:01.758+0800: [I] AFM: Previous error
repeated 1 times.
2022-06-18_08:46:01.759+0800: [E] AFM: Read file system
gpfs1 fileset myfileset file IDs [186342413.186342413.
-1.-1,N] name / remote error 5
2022-06-18_08:46:01.759+0800: Input/output error
AFM: Failed to create snapshot for fileset myfileset to
run recovery.
Unable to quiesce all nodes; some processes are busy or
holding required resources.
2022-06-18_08:46:02.465+0800: [I] AFM: Stopping recovery
for the file system gpfs1 fileset myfileset
2022-06-18_08:46:04.469+0800: [I] Calling user exit
script
 mmAfmRecoveryFail: event afmRecoveryFail, Async command
/usr/lpp/mmfs/bin/mmsysmonc, filesystem gpfs1, fileset
myfileset.
2022-06-18_08:46:04.475+0800: [E] AFM: Recovery on file
system gpfs1 fileset myfileset failed with error 78.
Recovery will be retried on next access after recovery
retry interval (120 seconds) or manually resolve known
problems and recover the fileset.

Reported in:
Spectrum Scale 5.1.1.4 on RHEL7

Known Impact:
AFM fileset not active

Verification steps:

Recovery action:
Stop the jobs triggering the slow reads from home,
this will let recovery completed first.

Local fix

```
N/A
```

Problem summary

Synchronous on-demand Read which triggers recovery on
Independent-Writer mode fileset can block recovery if the
home file is in migrated state

Problem conclusion

This problem is fixed in 5.1.2 PTF6
To see all Spectrum Scale APARs and
their respective fix solutions refer to page
https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
apars.html

Benefits of the solution:
Recovery not blocked

Work around:
Trigger recovery separately on
IW fileset through ls or touch
operations and then trigger such sync reads on uncached
file which might be migrated to HSM at the home site.
Problem trigger:
Sync read being performed on IW
fileset which needs recovery
to be run and the file being
migrated to HSM at the home site.
Symptom:
Performance Impact
Platforms affected:
All Linux OS Environments (Serving as AFM Gateway nodes)
Functional Area affected:
AFM
Customer Impact:
High importance

Temporary fix

Comments

APAR Information

APAR number
IJ41328
Reported component name
SPEC SCALE ADV
Reported component ID
5737F35AP
Reported release
511
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-07-22
Closed date
2022-08-01
Last modified date
2022-08-01

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE ADV
Fixed component ID
5737F35AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
01 August 2022

Tips

IJ41328: AFM: FILESET STUCK IN FLUSHONLY, RECOVERY BLOCKED BY SYNC OP

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?