IBM Support

IJ41328: AFM: FILESET STUCK IN FLUSHONLY, RECOVERY BLOCKED BY SYNC OP

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • Error Description:
    
    A IW fileset might be stuck in Recovery/FlushOnly status
    when there are sync reads from client,  and the reads
    take long time to complete (eg. the files is huge and
    migrated to tape at home) hence block recovery's snapshot
    create at cache.
    
    From the gateway,  we can see there are two inflight
    reads
     from client nodes.
    
    # mmfsdm dump afm fset myfileset
    Fileset: myfileset 5 (gpfs1)
      mode: independent-writer queue: Blocked  MDS: <c0n0>
    QMem 1108 CTL 311
      home: mapping1 homeServer: 10.10.10.1 proto: nfs port:
    2049   lastCmd: 16
      handler: Mounted Dirty  >> RecoveryInProg <<
    refCount: 4
      queueTransfer: state: Idle senderVerified: 0
    receiverVerified: 1 terminate: 0 psnapWait: 0
      remoteAttrs: AsyncLookups 0 tsfindinode: success 0
    failed 0 totalTime 0.0 avgTime 0.000000 maxTime 0.0
      queue: delay 90 QLen 2+0 flushThds 0 maxFlushThds 10
    numExec 261076 qfs 0 iwo 0 err 5
      handlerCreateTime : 2022-06-18_08:08:24.437+0800
    numCreateSnaps : 0 InflightAsyncLookups 0
      lastReplayTime : 2022-06-18_08:08:24.437+0800
    lastSyncTime : 2022-06-18_08:08:24.437+0800
      i/o: readBuf: 33554432 writeBuf: 2097152
    sparseReadThresh: 134217728  pReadThreads 48
      i/o: pReadChunkSize 134217728 pReadThresh: 858993459200
      pWriteThresh: 858993459200
      i/o: prefetchThresh 0 (Prefetch)
      Mnt status: 0:0 1:0 2:0 3:0
      Export Map:  10.10.10.1/<c0n0>  10.10.10.2/<c0n1>
      iw: afmIwTakeoverTime 0
      Priority Queue:  Empty (state: Blocked)
      Normal Queue: (listed by execution
    order)(state:Blocked)
           0 Read [203362323.203362323] inflight (3424 @
    32320233472) vIdx 3 tid 84708
           1 Read [186342413.186342413] inflight (6972 @
    8009179136) vIdx 1 tid 136195
    
    
    From gateway's GPFS log /var/adm/ras/mmfs.log.latest, we
    can see the fileset recovery failed because snapshot
    could
     not be created after quiesce timeout, and read from home
     failed with error 5 or error 78.
    
    2022-06-18_08:34:59.590+0800: [I] Calling user exit
    script
     mmAfmRecoveryStart: event afmRecoveryStart, Async
    command
     /usr/lpp/mmfs/bin/mmsysmonc, filesystem gpfs1, fileset
    my
    fileset.
    2022-06-18_08:35:00.807+0800: [N] AFM: Starting recovery
    for fileset 'myfileset' in fs 'gpfs1'
    2022-06-18_08:35:00.819+0800: mmcommon afmrecovery
    invoked
    : device=gpfs1 filesetId=5 filesetName=myfileset resync=0
    recoverySeq=23994175 snap=1 mode=6 fsType=0 drpsnap=0
    2022-06-18_08:35:01.596+0800: [I] AFM: Cleaning up stale
    recovery snapshots
    2022-06-18_08:35:11.782+0800: [D] Snapshot whole quiesce
    of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
    10076 msec.
    2022-06-18_08:36:21.854+0800: [D] Snapshot whole quiesce
    of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
    10063 msec.
    2022-06-18_08:37:32.002+0800: [D] Snapshot whole quiesce
    of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
    10098 msec.
    2022-06-18_08:38:42.113+0800: [D] Snapshot whole quiesce
    of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
    10093 msec.
    2022-06-18_08:39:52.232+0800: [D] Snapshot whole quiesce
    of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
    10099 msec.
    2022-06-18_08:40:11.468+0800: [W] AFM: Home is taking
    longer than expected to respond for cache myfileset,
    mount index 3.
    2022-06-18_08:40:21.475+0800: [W] AFM: Home is taking
    longer than expected to respond for cache myfileset,
    mount
     index 3.
    2022-06-18_08:40:21.487+0800: [I] AFM: Previous error
    repeated 2 times.
    2022-06-18_08:40:21.487+0800: [E] AFM: Read file system
    gpfs1 fileset myfileset file IDs [182398983.182398983.
    -1.-1,N] name / remote error 5
    2022-06-18_08:40:21.487+0800: Input/output error
    2022-06-18_08:41:02.350+0800: [D] Snapshot whole quiesce
    of SG gpfs1 from fsmgr.gpfs.cluster on this node lasted
    10094 msec.
    2022-06-18_08:45:31.729+0800: [I] AFM: Previous error
    repeated 1 times.
    ....................
    2022-06-18_08:45:51.745+0800: [E] AFM: Read file system
    gpfs1 fileset myfileset file IDs [186342413.186342413.
    -1.-1,T] name / remote error 78
    2022-06-18_08:45:51.745+0800: Connection timed out
    2022-06-18_08:45:51.745+0800: [W] AFM: Home is taking
    longer than expected to respond for cache myfileset,
    mount
     index 1.
    2022-06-18_08:46:01.748+0800: [W] AFM: Home is taking
    longer than expected to respond for cache myfileset,
    mount
     index 1.
    2022-06-18_08:46:01.758+0800: [I] AFM: Previous error
    repeated 1 times.
    2022-06-18_08:46:01.759+0800: [E] AFM: Read file system
    gpfs1 fileset myfileset file IDs [186342413.186342413.
    -1.-1,N] name / remote error 5
    2022-06-18_08:46:01.759+0800: Input/output error
    AFM: Failed to create snapshot for fileset myfileset to
    run recovery.
    Unable to quiesce all nodes; some processes are busy or
    holding required resources.
    2022-06-18_08:46:02.465+0800: [I] AFM: Stopping recovery
    for the file system gpfs1 fileset myfileset
    2022-06-18_08:46:04.469+0800: [I] Calling user exit
    script
     mmAfmRecoveryFail: event afmRecoveryFail, Async command
    /usr/lpp/mmfs/bin/mmsysmonc, filesystem gpfs1, fileset
    myfileset.
    2022-06-18_08:46:04.475+0800: [E] AFM: Recovery on file
    system gpfs1 fileset myfileset failed with error 78.
    Recovery will be retried on next access after recovery
    retry interval (120 seconds) or manually resolve known
    problems and recover the fileset.
    
    Reported in:
    Spectrum Scale 5.1.1.4 on RHEL7
    
    Known Impact:
    AFM fileset not active
    
    Verification steps:
    
    Recovery action:
    Stop the jobs triggering the slow reads from home,
    this will let recovery completed first.
    

Local fix

  • N/A
    

Problem summary

  • Synchronous on-demand Read which triggers recovery on
    Independent-Writer mode fileset can block recovery if the
    home file is in migrated state
    

Problem conclusion

  • This problem is fixed in 5.1.2 PTF6
    To see all Spectrum Scale APARs and
    their respective fix solutions refer to page
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Recovery not blocked
    
    Work around:
    Trigger recovery separately on
    IW fileset through ls or touch
    operations and then trigger such sync reads on uncached
    file which might be migrated to HSM at the home site.
    Problem trigger:
    Sync read being performed on IW
    fileset which needs recovery
    to be run and the file being
    migrated to HSM at the home site.
    Symptom:
    Performance Impact
    Platforms affected:
    All Linux OS Environments (Serving as AFM Gateway nodes)
    Functional Area affected:
    AFM
    Customer Impact:
    High importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ41328

  • Reported component name

    SPEC SCALE ADV

  • Reported component ID

    5737F35AP

  • Reported release

    511

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-07-22

  • Closed date

    2022-08-01

  • Last modified date

    2022-08-01

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE ADV

  • Fixed component ID

    5737F35AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"511","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
01 August 2022