IBM Support

IT36694: SPACE MANAGEMENT DISTRIBUTED RECALL MIGHT HANG DUE TO LONG WAITERS.

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • The IBM Spectrum Protect Client for Space Management (HSM)
    distributed recall might hang due to long
    waiters on the
    nodes.
    
    When the distributed recall daemon delegates a recall
    request to another HSM node and problem occurs
    on the other
    node, such as a restart of dsmrecalld (HSM recall daemon), then
    depending on internal run-time
    conditions, the other node can
    lose track of the recall request.
    
    However, the distributed
    recall daemon does not know this, and thinks the recall request
    is still pending.
    As a result, subsequent requests to access
    the same i-node wait forever, since they cannot be satisfied
    
    until the original request is satisfied. The original request
    is never satisfied since it was discarded on the
    remote
    node.
    
    Customer/Support Diagnostics (If Applicable):
    
    The
    stack trace for the hanging process (pid: 163304) might be
    similar to following output:
    
    cat
    /proc/163304/stack
    [<ffffffffc0898ee1>]
    cxiWaitEventWait+0x1d1/0x2f0 [mmfslinux]
    [<ffffffffc0a86610>]
    _ZN6ThCond12internalWaitEP16KernelSynchStatejPv+0xd0/0x260
    [mmfs26]
    [<ffffffffc0a87cdb>] _ZN6ThCond5kWaitEiPKc+0x1db/0x410
    [mmfs26]
    [<ffffffffc09ada75>]
    _ZN13KernelMailbox21sendToDaemonWithReplyEv+0x375/0x410
    [mmfs26]
    [<ffffffffc0ada7c3>] _Z19dmGenerateDataEvent14dm_eventt
    ype_tP15KernelOperationiP13gpfsVfsData_tP7FileUIDP10gpfsNode_txy
    P5ErrnoPiPxPjSC_+0x15b3/0x1af0 [mmfs26]
    [<ffffffffc0a1b600>] _Z8
    gpfsReadP13gpfsVfsData_tP15KernelOperationP9cxiNode_tiP8cxiUio_t
    P9MMFSVInfoP10cxiVattr_tSA_P10ext_cred_tP14cxiPageLists_ti+0x17b
    0/0x3530 [mmfs26]
    [<ffffffffc08a4dd4>] rdwrInternal+0x454/0x6b0
    [mmfslinux]
    [<ffffffffc08a50a7>] gpfs_f_read+0x77/0xc0
    [mmfslinux]
    [<ffffffff81200bac>]
    vfs_read+0x9c/0x170
    [<ffffffff81201a6f>]
    SyS_read+0x7f/0xe0
    [<ffffffff816b5212>]
    tracesys+0xdd/0xe2
    [<ffffffffffffffff>] 0xffffffffffffffff
    
    The
    Scale "mmhealth node eventlog --day" command would report the
    following:
    ...
    2021-02-07 23:34:18.438490 AEDT
    longwaiters_found ERROR Detected Spectrum Scale
    long-waiters
    2021-02-07 23:34:18.438490 AEDT deadlock_detected
    WARNING The cluster detected a Spectrum Scale filesystem
    deadlock
    ...
    
    The Scale "mmfsadm dump dmapi" command would
    report a few data events accumulating for the same i-node (i.e.
    251092306) over time:
    ...
     Session ID 4c85d45d00000000,
     DMAPI
    event queue:
     Number of events on queue
    (total/inactive/outstanding): 25/0/25
     (sync/async): 25/0
    
    (repost): 0
     Event queue accumulative statistics:
     Total number
    of events added (total/sync/async): 666283/7187/659096
    
    reposted: 0
     moved from another queue: 0
     added as user event:
    660607
     received by the DM application: 664772
     responded by
    the DM application: 7162
     moved to another queue: 0
     Last sync
    event added to queue is on Sat Dec 14 18:42:30 2020
     Last sync
    user event added to queue is on Wed Dec 4 09:58:15 2020
     Last
    async user event added to queue is on Sun Dec 15 17:36:17 2020
    
    Event Q element 1:
     Status: EV_VALID
     node <c0n6>
     Event type:
    READ
     Event token: 0x00000006-00000000-65679561-00000000
     Event
    sequence: 0x00000006-00000000-65679561-00000000
     Event node
    (SDR node_number): 9
     de_handle: sgId 0A01061E:5C76089A inode
    251092306 gen 0xFC197D2 snap 0
     de_offset: 0x0, de_length:
    32768
     Num of Access rights 0
     Version number 2
     Number of
    nodes with cached access rights: 0
     Added to queue on: Tue Dec
    10 12:56:33 2020
     Event delivered on: Tue Dec 10 12:56:33 2020
    
    Event Q element 2:
     Status: EV_VALID
     node <c0n6>
     Event type:
    READ
     Event token: 0x00000006-00000000-65681C6D-00000000
     Event
    sequence: 0x00000006-00000000-65681C6D-00000000
     Event node
    (SDR node_number): 9
     de_handle: sgId 0A01061E:5C76089A inode
    251092306 gen 0xFC197D2 snap 0
     de_offset: 0x0, de_length:
    32768
     Num of Access rights 0
     Version number
    2
    .................
    
     Event Q element 25:
     Status: EV_VALID
    
    node <c0n8>
     Event type: READ
     Event token:
    0x00000008-00000000-D0A97EA8-00000000
     Event sequence:
    0x00000008-00000000-D0A97EA8-00000000
     Event node (SDR
    node_number): 6
     de_handle: sgId 0A01061E:5C76089A inode
    251092306 gen 0xFC197D2 snap 0
     de_offset: 0x30000, de_length:
    65536
     Num of Access rights 0
     Version number 2
     Number of
    nodes with cached access rights: 0
     Added to queue on: Sat Dec
    14 18:41:08 2020
     Event delivered on: Sat Dec 14 18:41:08
    2020
    ...
    
    IBM Spectrum Protect Versions Affected: Space
    Management Client for Unix and Linux 7.1.x and 8.1.x on all
    supported platforms
    
    Initial Impact: Low
    
    Additional KEYWORDS:
    Spectrum Protect; HSM; TS004921636; recall; hang; dsmrecalld
    

Local fix

  • Restart dsmrecalld daemons on all the nodes
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * IBM Spectrum Protect for Space Management (HSM) client       *
    * versions 7.1.x and 8.1.x on all platforms                    *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See ERROR DESCRIPTION.                                       *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is projected *
    * to be fixed in HSM client levels 7.1.8.11 and 8.1.12. Note   *
    * that this is subject to change at the discretion of IBM.     *
    ****************************************************************
    

Problem conclusion

  • After the fix, the HSM recall function will work reliably
    regardless of the current recall workload.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT36694

  • Reported component name

    TSM SPACE MGMT

  • Reported component ID

    5698HSMCL

  • Reported release

    81L

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2021-04-26

  • Closed date

    2021-06-07

  • Last modified date

    2021-06-07

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Modules/Macros

  • dsmrecal
    

Fix information

  • Fixed component name

    TSM SPACE MGMT

  • Fixed component ID

    5698HSMCL

Applicable component levels

[{"Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSSR2R","label":"Tivoli Storage Manager for Space Management"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81L"}]

Document Information

Modified date:
08 June 2021