APAR status
Closed as program error.
Error description
The IBM Spectrum Protect Client for Space Management (HSM) distributed recall might hang due to long waiters on the nodes. When the distributed recall daemon delegates a recall request to another HSM node and problem occurs on the other node, such as a restart of dsmrecalld (HSM recall daemon), then depending on internal run-time conditions, the other node can lose track of the recall request. However, the distributed recall daemon does not know this, and thinks the recall request is still pending. As a result, subsequent requests to access the same i-node wait forever, since they cannot be satisfied until the original request is satisfied. The original request is never satisfied since it was discarded on the remote node. Customer/Support Diagnostics (If Applicable): The stack trace for the hanging process (pid: 163304) might be similar to following output: cat /proc/163304/stack [<ffffffffc0898ee1>] cxiWaitEventWait+0x1d1/0x2f0 [mmfslinux] [<ffffffffc0a86610>] _ZN6ThCond12internalWaitEP16KernelSynchStatejPv+0xd0/0x260 [mmfs26] [<ffffffffc0a87cdb>] _ZN6ThCond5kWaitEiPKc+0x1db/0x410 [mmfs26] [<ffffffffc09ada75>] _ZN13KernelMailbox21sendToDaemonWithReplyEv+0x375/0x410 [mmfs26] [<ffffffffc0ada7c3>] _Z19dmGenerateDataEvent14dm_eventt ype_tP15KernelOperationiP13gpfsVfsData_tP7FileUIDP10gpfsNode_txy P5ErrnoPiPxPjSC_+0x15b3/0x1af0 [mmfs26] [<ffffffffc0a1b600>] _Z8 gpfsReadP13gpfsVfsData_tP15KernelOperationP9cxiNode_tiP8cxiUio_t P9MMFSVInfoP10cxiVattr_tSA_P10ext_cred_tP14cxiPageLists_ti+0x17b 0/0x3530 [mmfs26] [<ffffffffc08a4dd4>] rdwrInternal+0x454/0x6b0 [mmfslinux] [<ffffffffc08a50a7>] gpfs_f_read+0x77/0xc0 [mmfslinux] [<ffffffff81200bac>] vfs_read+0x9c/0x170 [<ffffffff81201a6f>] SyS_read+0x7f/0xe0 [<ffffffff816b5212>] tracesys+0xdd/0xe2 [<ffffffffffffffff>] 0xffffffffffffffff The Scale "mmhealth node eventlog --day" command would report the following: ... 2021-02-07 23:34:18.438490 AEDT longwaiters_found ERROR Detected Spectrum Scale long-waiters 2021-02-07 23:34:18.438490 AEDT deadlock_detected WARNING The cluster detected a Spectrum Scale filesystem deadlock ... The Scale "mmfsadm dump dmapi" command would report a few data events accumulating for the same i-node (i.e. 251092306) over time: ... Session ID 4c85d45d00000000, DMAPI event queue: Number of events on queue (total/inactive/outstanding): 25/0/25 (sync/async): 25/0 (repost): 0 Event queue accumulative statistics: Total number of events added (total/sync/async): 666283/7187/659096 reposted: 0 moved from another queue: 0 added as user event: 660607 received by the DM application: 664772 responded by the DM application: 7162 moved to another queue: 0 Last sync event added to queue is on Sat Dec 14 18:42:30 2020 Last sync user event added to queue is on Wed Dec 4 09:58:15 2020 Last async user event added to queue is on Sun Dec 15 17:36:17 2020 Event Q element 1: Status: EV_VALID node <c0n6> Event type: READ Event token: 0x00000006-00000000-65679561-00000000 Event sequence: 0x00000006-00000000-65679561-00000000 Event node (SDR node_number): 9 de_handle: sgId 0A01061E:5C76089A inode 251092306 gen 0xFC197D2 snap 0 de_offset: 0x0, de_length: 32768 Num of Access rights 0 Version number 2 Number of nodes with cached access rights: 0 Added to queue on: Tue Dec 10 12:56:33 2020 Event delivered on: Tue Dec 10 12:56:33 2020 Event Q element 2: Status: EV_VALID node <c0n6> Event type: READ Event token: 0x00000006-00000000-65681C6D-00000000 Event sequence: 0x00000006-00000000-65681C6D-00000000 Event node (SDR node_number): 9 de_handle: sgId 0A01061E:5C76089A inode 251092306 gen 0xFC197D2 snap 0 de_offset: 0x0, de_length: 32768 Num of Access rights 0 Version number 2 ................. Event Q element 25: Status: EV_VALID node <c0n8> Event type: READ Event token: 0x00000008-00000000-D0A97EA8-00000000 Event sequence: 0x00000008-00000000-D0A97EA8-00000000 Event node (SDR node_number): 6 de_handle: sgId 0A01061E:5C76089A inode 251092306 gen 0xFC197D2 snap 0 de_offset: 0x30000, de_length: 65536 Num of Access rights 0 Version number 2 Number of nodes with cached access rights: 0 Added to queue on: Sat Dec 14 18:41:08 2020 Event delivered on: Sat Dec 14 18:41:08 2020 ... IBM Spectrum Protect Versions Affected: Space Management Client for Unix and Linux 7.1.x and 8.1.x on all supported platforms Initial Impact: Low Additional KEYWORDS: Spectrum Protect; HSM; TS004921636; recall; hang; dsmrecalld
Local fix
Restart dsmrecalld daemons on all the nodes
Problem summary
**************************************************************** * USERS AFFECTED: * * IBM Spectrum Protect for Space Management (HSM) client * * versions 7.1.x and 8.1.x on all platforms * **************************************************************** * PROBLEM DESCRIPTION: * * See ERROR DESCRIPTION. * **************************************************************** * RECOMMENDATION: * * Apply fixing level when available. This problem is projected * * to be fixed in HSM client levels 7.1.8.11 and 8.1.12. Note * * that this is subject to change at the discretion of IBM. * ****************************************************************
Problem conclusion
After the fix, the HSM recall function will work reliably regardless of the current recall workload.
Temporary fix
Comments
APAR Information
APAR number
IT36694
Reported component name
TSM SPACE MGMT
Reported component ID
5698HSMCL
Reported release
81L
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-04-26
Closed date
2021-06-07
Last modified date
2021-06-07
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Modules/Macros
dsmrecal
Fix information
Fixed component name
TSM SPACE MGMT
Fixed component ID
5698HSMCL
Applicable component levels
[{"Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSSR2R","label":"Tivoli Storage Manager for Space Management"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81L"}]
Document Information
Modified date:
08 June 2021