IT36694: SPACE MANAGEMENT DISTRIBUTED RECALL MIGHT HANG DUE TO LONG WAITERS.

APAR status

Closed as program error.

Error description

The IBM Spectrum Protect Client for Space Management (HSM)
distributed recall might hang due to long
waiters on the
nodes.

When the distributed recall daemon delegates a recall
request to another HSM node and problem occurs
on the other
node, such as a restart of dsmrecalld (HSM recall daemon), then
depending on internal run-time
conditions, the other node can
lose track of the recall request.

However, the distributed
recall daemon does not know this, and thinks the recall request
is still pending.
As a result, subsequent requests to access
the same i-node wait forever, since they cannot be satisfied

until the original request is satisfied. The original request
is never satisfied since it was discarded on the
remote
node.

Customer/Support Diagnostics (If Applicable):

The
stack trace for the hanging process (pid: 163304) might be
similar to following output:

cat
/proc/163304/stack
[<ffffffffc0898ee1>]
cxiWaitEventWait+0x1d1/0x2f0 [mmfslinux]
[<ffffffffc0a86610>]
_ZN6ThCond12internalWaitEP16KernelSynchStatejPv+0xd0/0x260
[mmfs26]
[<ffffffffc0a87cdb>] _ZN6ThCond5kWaitEiPKc+0x1db/0x410
[mmfs26]
[<ffffffffc09ada75>]
_ZN13KernelMailbox21sendToDaemonWithReplyEv+0x375/0x410
[mmfs26]
[<ffffffffc0ada7c3>] _Z19dmGenerateDataEvent14dm_eventt
ype_tP15KernelOperationiP13gpfsVfsData_tP7FileUIDP10gpfsNode_txy
P5ErrnoPiPxPjSC_+0x15b3/0x1af0 [mmfs26]
[<ffffffffc0a1b600>] _Z8
gpfsReadP13gpfsVfsData_tP15KernelOperationP9cxiNode_tiP8cxiUio_t
P9MMFSVInfoP10cxiVattr_tSA_P10ext_cred_tP14cxiPageLists_ti+0x17b
0/0x3530 [mmfs26]
[<ffffffffc08a4dd4>] rdwrInternal+0x454/0x6b0
[mmfslinux]
[<ffffffffc08a50a7>] gpfs_f_read+0x77/0xc0
[mmfslinux]
[<ffffffff81200bac>]
vfs_read+0x9c/0x170
[<ffffffff81201a6f>]
SyS_read+0x7f/0xe0
[<ffffffff816b5212>]
tracesys+0xdd/0xe2
[<ffffffffffffffff>] 0xffffffffffffffff

The
Scale "mmhealth node eventlog --day" command would report the
following:
...
2021-02-07 23:34:18.438490 AEDT
longwaiters_found ERROR Detected Spectrum Scale
long-waiters
2021-02-07 23:34:18.438490 AEDT deadlock_detected
WARNING The cluster detected a Spectrum Scale filesystem
deadlock
...

The Scale "mmfsadm dump dmapi" command would
report a few data events accumulating for the same i-node (i.e.
251092306) over time:
...
 Session ID 4c85d45d00000000,
 DMAPI
event queue:
 Number of events on queue
(total/inactive/outstanding): 25/0/25
 (sync/async): 25/0

(repost): 0
 Event queue accumulative statistics:
 Total number
of events added (total/sync/async): 666283/7187/659096

reposted: 0
 moved from another queue: 0
 added as user event:
660607
 received by the DM application: 664772
 responded by
the DM application: 7162
 moved to another queue: 0
 Last sync
event added to queue is on Sat Dec 14 18:42:30 2020
 Last sync
user event added to queue is on Wed Dec 4 09:58:15 2020
 Last
async user event added to queue is on Sun Dec 15 17:36:17 2020

Event Q element 1:
 Status: EV_VALID
 node <c0n6>
 Event type:
READ
 Event token: 0x00000006-00000000-65679561-00000000
 Event
sequence: 0x00000006-00000000-65679561-00000000
 Event node
(SDR node_number): 9
 de_handle: sgId 0A01061E:5C76089A inode
251092306 gen 0xFC197D2 snap 0
 de_offset: 0x0, de_length:
32768
 Num of Access rights 0
 Version number 2
 Number of
nodes with cached access rights: 0
 Added to queue on: Tue Dec
10 12:56:33 2020
 Event delivered on: Tue Dec 10 12:56:33 2020

Event Q element 2:
 Status: EV_VALID
 node <c0n6>
 Event type:
READ
 Event token: 0x00000006-00000000-65681C6D-00000000
 Event
sequence: 0x00000006-00000000-65681C6D-00000000
 Event node
(SDR node_number): 9
 de_handle: sgId 0A01061E:5C76089A inode
251092306 gen 0xFC197D2 snap 0
 de_offset: 0x0, de_length:
32768
 Num of Access rights 0
 Version number
2
.................

 Event Q element 25:
 Status: EV_VALID

node <c0n8>
 Event type: READ
 Event token:
0x00000008-00000000-D0A97EA8-00000000
 Event sequence:
0x00000008-00000000-D0A97EA8-00000000
 Event node (SDR
node_number): 6
 de_handle: sgId 0A01061E:5C76089A inode
251092306 gen 0xFC197D2 snap 0
 de_offset: 0x30000, de_length:
65536
 Num of Access rights 0
 Version number 2
 Number of
nodes with cached access rights: 0
 Added to queue on: Sat Dec
14 18:41:08 2020
 Event delivered on: Sat Dec 14 18:41:08
2020
...

IBM Spectrum Protect Versions Affected: Space
Management Client for Unix and Linux 7.1.x and 8.1.x on all
supported platforms

Initial Impact: Low

Additional KEYWORDS:
Spectrum Protect; HSM; TS004921636; recall; hang; dsmrecalld

Local fix

Restart dsmrecalld daemons on all the nodes

Problem summary

****************************************************************
* USERS AFFECTED:                                              *
* IBM Spectrum Protect for Space Management (HSM) client       *
* versions 7.1.x and 8.1.x on all platforms                    *
****************************************************************
* PROBLEM DESCRIPTION:                                         *
* See ERROR DESCRIPTION.                                       *
****************************************************************
* RECOMMENDATION:                                              *
* Apply fixing level when available. This problem is projected *
* to be fixed in HSM client levels 7.1.8.11 and 8.1.12. Note   *
* that this is subject to change at the discretion of IBM.     *
****************************************************************

Problem conclusion

After the fix, the HSM recall function will work reliably
regardless of the current recall workload.

Temporary fix

Comments

APAR Information

APAR number
IT36694
Reported component name
TSM SPACE MGMT
Reported component ID
5698HSMCL
Reported release
81L
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2021-04-26
Closed date
2021-06-07
Last modified date
2021-06-07

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Modules/Macros

```
dsmrecal
```

Fix information

Fixed component name
TSM SPACE MGMT
Fixed component ID
5698HSMCL

Applicable component levels

[{"Line of Business":{"code":"LOB26","label":"Storage"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSSR2R","label":"Tivoli Storage Manager for Space Management"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"81L"}]

Document Information

Modified date:
08 June 2021

Tips

IT36694: SPACE MANAGEMENT DISTRIBUTED RECALL MIGHT HANG DUE TO LONG WAITERS.

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Modules/Macros

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?