APAR status
Closed as program error.
Error description
When a cluster loses a file system manager (perhaps due to network connectivity) another manager node must take over. As part of that takeover, IO to that file system's disks must be suspended until the takeover is complete. In some circumstances when the disks use persistent reserve, a deadlock can occur between the file system manager takeover and the requests to fence the disks. Until the file system manager takeover completes, no further IO to that file system can take place. The deadlock prevents that takeover from ever completing. Another impact is that cluster nodes are not able to rejoin the cluster while the file system manager takeover is pending, and so those nodes will appear in "arbitrating" state and not get to "active". Work around: The deadlock can be cleared by restarting Spectrum Scale on all nodes in the cluster at once, meaning a "mmshutdown -a" followed by a "mmstartup -a".
Local fix
Problem summary
Deadlock between log recovery and file system takeover caused node to be stuck in arbitrating state
Problem conclusion
This problem is fixed in 5.1.2.10 To see all Spectrum Scale APARs and their respective Fix solutions refer to page: https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_ apars.html Benefits of the solution: Fixed the code where GPFS unconditionally called disk fencing where it should not Work Around: None Problem trigger: File system manager takeover and log recovery happens with disk fencing error condition, under some timing Symptom: Hang/Deadlock/Unresponsiveness/Long Waiters Platforms affected: ALL Operating System environments Functional Area affected: All Scale Users Customer Impact: High Importance
Temporary fix
Comments
APAR Information
APAR number
IJ44682
Reported component name
SPEC SCALE DME
Reported component ID
5737F34AP
Reported release
514
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-12-20
Closed date
2023-03-03
Last modified date
2023-03-03
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SPEC SCALE DME
Fixed component ID
5737F34AP
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"514","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
03 March 2023