IBM Support

IT40489: SINGLE CLOUD COPY SESSION REMAINS STUCK EVEN WHEN OTHER JOBS COMPLETE

Subscribe to this APAR

By subscribing, you receive periodic emails alerting you to the status of the APAR, along with a link to the fix after it becomes available. You can track this item individually or track all items by product.

Notify me when this APAR changes.

Notify me when an APAR for this component changes.

 

APAR status

  • Closed as program error.

Error description

  • A cloud copy job in IBM Spectrum Protect Plus gets stuck
    after initially making some progress. The job can involve
    multiple volumes being copied and most of them complete
    successfully, but one of them remains stuck. The job log
    shows the same progress message repeating indefinitely with
    no progress being made. For example, the following message
    keeps repeating but the amount of data transferred does not
    change for many hours:
    
    CTGGA0365, Snapshot <snapshot details> has transferred
        100 GB (Last status: Transferred 100 GB)
    
    Other jobs running around the same time or even jobs that
    start at a later time continue to work as expected. The
    vSnap server does not show a general hang of all zfs
    processes. However the output of command "vsnap cloud
    session show" indicates that one session remains stuck in
    ACTIVE state even though other sessions are in COMPLETED
    state.
    
    In the repl.log on the vSnap server, the stuck session
    initially logs progress messages and then stops logging
    anything. Examining the list of send/recv processes on the
    vSnap using the command "ps aux | grep recv" shows that the
    recv process is present and is associated with the stuck
    session, but the process remains in sleep state and makes
    no progress.
    
    In the system log (/var/log/messages) on the vSnap server,
    the following can be seen around the time that the cloud
    session got stuck:
    
    kernel: INFO: task vdev_autotrim:<pid> blocked for more
        than 600 seconds.
    
    MDVREGR 10.1.8 5737SPLUS
    Affected versions: 10.1.9 and later
    

Local fix

  • Reboot the vSnap server to clear the stuck process. In some
    cases, disabling the "autotrim" property on the local pool
    can help avoid the hang. Use the "vsnap pool show"
    command to make a note of the pool ID, then disable the
    property using command "zpool set autotrim=off vpool<ID>".
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * IBM Spectrum Protect Plus levels 10.1.9 to 10.1.13           *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See Error Description                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply fixing level when available. This problem is currently *
    * projected to be fixed in IBM Spectrum Protect Plus level     *
    * 10.1.14. Note that this is subject to change at the          *
    * discretion of IBM.                                           *
    ****************************************************************
    

Problem conclusion

  • The problem occurred due to a deadlock in the Linux kernel that
    was triggered by aborted SCSI commands. Fixes for this issue
    have been incorporated in RHEL kernel version
    4.18.0-425.10.1.el8_7. The updated kernel has been incorporated
    into the vSnap installer and it will be applied when the vSnap
    server is freshly deployed or upgraded.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT40489

  • Reported component name

    SP PLUS

  • Reported component ID

    5737SPLUS

  • Reported release

    A19

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2022-04-02

  • Closed date

    2023-02-01

  • Last modified date

    2023-02-01

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SP PLUS

  • Fixed component ID

    5737SPLUS

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A19","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
01 February 2024