APAR status
Closed as program error.
Error description
A cloud copy job in IBM Spectrum Protect Plus gets stuck after initially making some progress. The job can involve multiple volumes being copied and most of them complete successfully, but one of them remains stuck. The job log shows the same progress message repeating indefinitely with no progress being made. For example, the following message keeps repeating but the amount of data transferred does not change for many hours: CTGGA0365, Snapshot <snapshot details> has transferred 100 GB (Last status: Transferred 100 GB) Other jobs running around the same time or even jobs that start at a later time continue to work as expected. The vSnap server does not show a general hang of all zfs processes. However the output of command "vsnap cloud session show" indicates that one session remains stuck in ACTIVE state even though other sessions are in COMPLETED state. In the repl.log on the vSnap server, the stuck session initially logs progress messages and then stops logging anything. Examining the list of send/recv processes on the vSnap using the command "ps aux | grep recv" shows that the recv process is present and is associated with the stuck session, but the process remains in sleep state and makes no progress. In the system log (/var/log/messages) on the vSnap server, the following can be seen around the time that the cloud session got stuck: kernel: INFO: task vdev_autotrim:<pid> blocked for more than 600 seconds. MDVREGR 10.1.8 5737SPLUS Affected versions: 10.1.9 and later
Local fix
Reboot the vSnap server to clear the stuck process. In some cases, disabling the "autotrim" property on the local pool can help avoid the hang. Use the "vsnap pool show" command to make a note of the pool ID, then disable the property using command "zpool set autotrim=off vpool<ID>".
Problem summary
**************************************************************** * USERS AFFECTED: * * IBM Spectrum Protect Plus levels 10.1.9 to 10.1.13 * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Apply fixing level when available. This problem is currently * * projected to be fixed in IBM Spectrum Protect Plus level * * 10.1.14. Note that this is subject to change at the * * discretion of IBM. * ****************************************************************
Problem conclusion
The problem occurred due to a deadlock in the Linux kernel that was triggered by aborted SCSI commands. Fixes for this issue have been incorporated in RHEL kernel version 4.18.0-425.10.1.el8_7. The updated kernel has been incorporated into the vSnap installer and it will be applied when the vSnap server is freshly deployed or upgraded.
Temporary fix
Comments
APAR Information
APAR number
IT40489
Reported component name
SP PLUS
Reported component ID
5737SPLUS
Reported release
A19
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2022-04-02
Closed date
2023-02-01
Last modified date
2023-02-01
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SP PLUS
Fixed component ID
5737SPLUS
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A19","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
01 February 2024