IBM Support

IJ44682: FILE SYSTEM MGR TAKEOVER DEADLOCK, NODES ARBITRATING

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • When a cluster loses a file system manager (perhaps
    due to network connectivity) another manager node must
    take over. As part of that takeover, IO to that file
    system's disks must be suspended until the takeover
    is complete.
    
    In some circumstances when the disks use persistent
    reserve, a deadlock can occur between the file system
    manager takeover and the requests to fence the disks.
    
    Until the file system manager takeover completes,
    no further IO to that file system can take place.
    The deadlock prevents that takeover from ever
    completing.
    
    Another impact is that cluster nodes are not able
    to rejoin the cluster while the file system manager
    takeover is pending, and so those nodes will appear
    in "arbitrating" state and not get to "active".
    
    Work around:
    
    The deadlock can be cleared by restarting Spectrum
    Scale on all nodes in the cluster at once, meaning a
    "mmshutdown -a" followed by a "mmstartup -a".
    

Local fix

Problem summary

  • Deadlock between log recovery and file system takeover caused
    node to be stuck in arbitrating state
    

Problem conclusion

  • This problem is fixed in 5.1.2.10
    To see all Spectrum Scale APARs and their respective
    Fix solutions refer to page:
    https://public.dhe.ibm.com/storage/spectrumscale/spectrum_scale_
    apars.html
    
    Benefits of the solution:
    Fixed the code where GPFS unconditionally called disk fencing
    where it should not
    
    Work Around:
    None
    
    Problem trigger:
    File system manager takeover and log recovery happens with disk
    fencing error condition, under some timing
    
    Symptom:
    Hang/Deadlock/Unresponsiveness/Long Waiters
    
    Platforms affected:
    ALL Operating System environments
    
    Functional Area affected:
    All Scale Users
    
    Customer Impact:
    High Importance
    

Temporary fix

Comments

APAR Information

  • APAR number

    IJ44682

  • Reported component name

    SPEC SCALE DME

  • Reported component ID

    5737F34AP

  • Reported release

    514

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-12-20

  • Closed date

    2023-03-03

  • Last modified date

    2023-03-03

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SPEC SCALE DME

  • Fixed component ID

    5737F34AP

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STXKQY"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"514","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
03 March 2023