APAR status
Closed as program error.
Error description
The vSnap server can experience I/O hangs due to deadlocks in the kernel and filesystem modules. The problem can be seen initially with one or more of the following symptoms: - Under the Disk screen of the SPP UI, the vSnap shows 'Ready' or 'Offline' status - VM backup operations appear to hang/stall for hours - VMware backups fail while trying to create or update NFS shares with errors indicating timeout of share commands on the vSnap server - SQL backups fail while trying to map LUNs to the application server with errors indicating timeout of 'vsnap_targetcli' commands - SQL backups fail with errors seen in the job log: "The system cannot find the file specified, The device is not ready." The root cause of all these symptoms is that I/O operations on the vSnap server are hanging. The problem can be confirmed by further investigation of the vSnap server. Running "ps aux | grep D" on the vSnap server shows many zfs processes that are in "D" state (i.e. hung or waiting for I/O). Further investigation of the system log (/var/log/messages) on the vSnap server shows errors indicating hung processes with call traces similar to the following: kernel: Call Trace: kernel: ? __schedule+0x2ab/0x880 kernel: schedule+0x32/0x80 kernel: schedule_preempt_disabled+0xa/0x10 kernel: __mutex_lock.isra.11+0x21b/0x4e0 kernel: ? cityhash4+0x78/0xa0 [zfs] kernel: dbuf_find+0xb8/0x190 [zfs] kernel: dbuf_hold_impl+0x62/0x590 [zfs] kernel: dbuf_hold_level+0x33/0x60 [zfs] kernel: dmu_tx_check_ioerr+0x32/0xc0 [zfs] kernel: dmu_tx_count_write+0xdd/0x190 [zfs] kernel: dmu_tx_hold_write_by_dnode+0x35/0x50 [zfs] kernel: zfs_write+0x516/0xcd0 [zfs] kernel: zpl_write_common_iovec+0xa9/0x120 [zfs] kernel: zpl_iter_write_common+0x98/0xc0 [zfs] kernel: zpl_iter_write+0x3f/0x70 [zfs] OR kernel: z_wr_iss D 0 18751 2 0x80000080 kernel: Call Trace: kernel: ? __schedule+0x2ab/0x880 kernel: schedule+0x32/0x80 kernel: schedule_preempt_disabled+0xa/0x10 kernel: __mutex_lock.isra.11+0x21b/0x4e0 kernel: ? cityhash4+0x78/0xa0 [zfs] kernel: dbuf_find+0x5a/0x190 [zfs] kernel: dbuf_hold_impl+0x62/0x590 [zfs] kernel: dbuf_hold_level+0x33/0x60 [zfs] kernel: dmu_buf_hold_noread+0x7c/0x100 [zfs] kernel: dmu_buf_hold+0x37/0x80 [zfs] kernel: zap_lockdir+0x4e/0xc0 [zfs] kernel: ? _cond_resched+0x15/0x30 kernel: ? __kmalloc_node+0x209/0x270 kernel: zap_length_uint64+0x51/0x100 [zfs] kernel: ddt_zap_lookup+0x62/0xe0 [zfs] kernel: ? spl_kmem_cache_alloc+0x91/0x110 [spl] kernel: ddt_lookup+0xce/0x1a0 [zfs] kernel: ? abd_checksum_SHA256+0x5e/0xb0 [zfs] kernel: ? zio_checksum_compute+0x24c/0x3b0 [zfs] kernel: zio_ddt_write+0x7a/0x530 [zfs] IBM Spectrum Protect Plus Versions Affected: IBM Spectrum Protect Plus 10.1.x Initial Impact: Medium Additional Keywords: SPP, SPPlus, TS003424236
Local fix
Reboot the vSnap to clear the hangs and run the backups again. To avoid triggering the hangs, try to reduce the amount of concurrent I/O on the vSnap server. This can be achieved using one of more of the following techniques: - Modify schedules of overlapping backup/replication jobs to prevent too many jobs from running at the same time - Modify schedule of the Maintenance job to make sure it runs during off-hours when other backup/replication jobs do not usually run - Modify VADP Proxy configuration by reducing the number of proxies or reducing the proxy softcap limit setting to prevent too many concurrent backup streams from running at the same time - Modify the Concurrent Backup setting on the Advanced Options page under the Disk management UI. Change the setting from "Unlimited" to "Limit" and set a limit value of 15 to throttle the amount of concurrent backup streams written to the vSnap
Problem summary
**************************************************************** * USERS AFFECTED: * * IBM Spectrum Protect Plus levels 10.1.5 and 10.1.6. * **************************************************************** * PROBLEM DESCRIPTION: * * See Error Description * **************************************************************** * RECOMMENDATION: * * Apply fixing level when available. This problem is currently * * projected to be fixed in IBM Spectrum Protect Plus level * * 10.1.6.ifix3 and 10.1.7. Note that this is subject to change * * at the discretion of IBM. * ****************************************************************
Problem conclusion
Multiple fixes were made in the filesystem modules to address hangs/deadlocks: - Fixed a race condition between destroying snapshots and regular I/O which caused hangs when maintenance operations overlapped with backups. - Fixed an issue with open NFS file descriptors not getting unlinked correctly. - Fixed a bug with filesystem commands deadlocking due to bad ordering of read/write locks. - Addressed a minor fix with filesystem memory cache contention causing the shrinking algorithm to block I/O for long periods of time. The shrinker algorithm is called by the operating system to drop cached inforation from RAM to free up space. In some cases the algorithm could block I/O on the entire system for long periods of time. It was also found that the hangs are more likely to occur when the virtual vSnap server is under CPU or memory pressure at the hypervisor level. To alleviate this problem, it is recommended that the virtual machine configuration of the vSnap be configured to ensure sufficient CPU and memory are reserved for the vSnap VM.
Temporary fix
Comments
APAR Information
APAR number
IT31841
Reported component name
SP PLUS
Reported component ID
5737SPLUS
Reported release
A10
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-03-05
Closed date
2020-08-31
Last modified date
2020-08-31
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SP PLUS
Fixed component ID
5737SPLUS
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A10","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
30 January 2024