APAR status
Closed as program error.
Error description
The IBM Spectrum Protect Plus vSnap server shows symptoms of I/O hangs even after applying the fixes that were previously made for APAR IT31841. The initial symptoms of job failures/hangs are similar to the symptoms described in IT31841. Running "ps aux | grep D" on the vSnap server shows many zfs processes that are in "D" state (i.e. hung or waiting for I/O). Further investigation on the vSnap shows that the in the output of the "vsnap system stats" command, the ARC is 100% used. This symptom will mainly be seen while the vSnap is the hung state and has not yet been rebooted. For example: ARC MAX SIZE: 85.28GB ARC USED SIZE: 85.30GB Another possible symptom is that stacks similar to the following are seen in the system log (/var/log/messages), although in some cases these messages may not be present: kernel: INFO: task dbuf_evict:1008 blocked for more than 600 seconds. kernel: Tainted: P OE 4.19.119-4c.el7.x86_64 #1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: dbuf_evict D 0 1008 2 0x80000000 kernel: Call Trace: kernel: ? __schedule+0x2ab/0x880 kernel: schedule+0x32/0x80 kernel: schedule_preempt_disabled+0xa/0x10 kernel: __mutex_lock.isra.11+0x21b/0x4e0 kernel: ? kmem_cache_free+0x10c/0x130 kernel: ? cityhash4+0x78/0xa0 [zfs] kernel: dbuf_hash_remove.constprop.18+0x53/0xf0 [zfs] kernel: dbuf_destroy+0x13e/0x300 [zfs] kernel: dbuf_evict_one+0xe7/0x100 [zfs] kernel: ? dbuf_evict_one+0x100/0x100 [zfs] kernel: dbuf_evict_thread+0xa4/0x100 [zfs] kernel: thread_generic_wrapper+0x6f/0x80 [spl] kernel: kthread+0xf8/0x130 kernel: ? __thread_exit+0x20/0x20 [spl] kernel: ? kthread_bind+0x10/0x10 kernel: ret_from_fork+0x35/0x40
Local fix
Reboot the vSnap to clear the hangs and run the backups again. To avoid triggering the hangs, try to reduce the amount of concurrent I/O on the vSnap server. This can be achieved using one of more of the following techniques: - Modify schedules of overlapping backup/replication jobs to prevent too many jobs from running at the same time - Modify schedule of the Maintenance job to make sure it runs during off-hours when other backup/replication jobs do not usually run - Modify VADP Proxy configuration by reducing the number of proxies or reducing the proxy softcap limit setting to prevent too many concurrent backup streams from running at the same time - Modify the Concurrent Backup setting on the Advanced Options page under the Disk management UI. Change the setting from "Unlimited" to "Limit" and set a limit value of 15 to throttle the amount of concurrent backup streams written to the vSnap
Problem summary
**************************************************************** * USERS AFFECTED: * * IBM Spectrum Protect level 10.1.5 and 10.1.6 * **************************************************************** * PROBLEM DESCRIPTION: * * See ERROR DESCRIPTION * **************************************************************** * RECOMMENDATION: * * Apply the fixing level when available. This problem is * * currently projected to be fixed in IBM Spectrum Protect Plus * * level 10.1.7. Note that this is subject to change at the * * discretion of IBM. * ****************************************************************
Problem conclusion
The hang was caused by a deadlock in the filesystem module in the kernel. The deadlock is caused by a lock ordering bug in the code path that prefetches dedup table (DDT) blocks into memory. This code path was enabled prior to 10.1.5 to speed up the loading of the DDT and thus improve performance. Starting with 10.1.5 patch1, further improvements were made under APAR IT31918 which introduced a new preload mechanism for the DDT. Given the newer improvements, the older prefetch logic is no longer necessary. The deadlock has thus been resolved by disabling the older prefetch code path.
Temporary fix
Comments
APAR Information
APAR number
IT34064
Reported component name
SP PLUS
Reported component ID
5737SPLUS
Reported release
A15
Status
CLOSED PER
PE
NoPE
HIPER
NoHIPER
Special Attention
NoSpecatt / Xsystem
Submitted date
2020-08-31
Closed date
2020-11-13
Last modified date
2020-11-19
APAR is sysrouted FROM one or more of the following:
APAR is sysrouted TO one or more of the following:
Fix information
Fixed component name
SP PLUS
Fixed component ID
5737SPLUS
Applicable component levels
[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A15","Line of Business":{"code":"LOB26","label":"Storage"}}]
Document Information
Modified date:
31 January 2024