IJ44682: FILE SYSTEM MGR TAKEOVER DEADLOCK, NODES ARBITRATING

APAR status

Closed as program error.

Error description

When a cluster loses a file system manager (perhaps
due to network connectivity) another manager node must
take over. As part of that takeover, IO to that file
system's disks must be suspended until the takeover
is complete.

In some circumstances when the disks use persistent
reserve, a deadlock can occur between the file system
manager takeover and the requests to fence the disks.

Until the file system manager takeover completes,
no further IO to that file system can take place.
The deadlock prevents that takeover from ever
completing.

Another impact is that cluster nodes are not able
to rejoin the cluster while the file system manager
takeover is pending, and so those nodes will appear
in "arbitrating" state and not get to "active".

Work around:

The deadlock can be cleared by restarting Spectrum
Scale on all nodes in the cluster at once, meaning a
"mmshutdown -a" followed by a "mmstartup -a".

Local fix

Problem summary

Deadlock between log recovery and file system takeover caused
node to be stuck in arbitrating state

Problem conclusion

This problem is fixed in 5.1.2.10
To see all Spectrum Scale APARs and their respective
Fix solutions refer to page:
https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
apars.html

Benefits of the solution:
Fixed the code where GPFS unconditionally called disk fencing
where it should not

Work Around:
None

Problem trigger:
File system manager takeover and log recovery happens with disk
fencing error condition, under some timing

Symptom:
Hang/Deadlock/Unresponsiveness/Long Waiters

Platforms affected:
ALL Operating System environments

Functional Area affected:
All Scale Users

Customer Impact:
High Importance

Temporary fix

Comments

APAR Information

APAR number
IJ44682
Reported component name
SPEC SCALE DME
Reported component ID
5737F34AP
Reported release
514
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-12-20
Closed date
2023-03-03
Last modified date
2023-03-03

APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name
SPEC SCALE DME
Fixed component ID
5737F34AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"514","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
03 March 2023

Tips

IJ44682: FILE SYSTEM MGR TAKEOVER DEADLOCK, NODES ARBITRATING

Subscribe to this APAR

APAR status

Closed as program error.

Error description

Local fix

Problem summary

Problem conclusion

Temporary fix

Comments

APAR Information

APAR number

Reported component name

Reported component ID

Reported release

Status

PE

HIPER

Special Attention

Submitted date

Closed date

Last modified date

APAR is sysrouted FROM one or more of the following:

APAR is sysrouted TO one or more of the following:

Fix information

Fixed component name

Fixed component ID

Applicable component levels

Document Information

Share your feedback

Need support?