IBM Support

IT34064: VSNAP SHOWS I/O HANGS EVEN AFTER APPLYING FIXES FOR APAR IT31841

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as program error.

Error description

  • The IBM Spectrum Protect Plus vSnap server shows symptoms of
    I/O hangs even after applying the fixes that were previously
    made for APAR IT31841.
    
    The initial symptoms of job
    failures/hangs are similar to the symptoms described in
    IT31841. Running "ps aux | grep D" on the vSnap server shows
    many zfs processes that are in "D" state (i.e. hung or waiting
    for I/O).
    
    Further investigation on the vSnap shows that the in
    the output of the "vsnap system stats" command, the ARC is 100%
    used. This symptom will mainly be seen while the vSnap is the
    hung state and has not yet been rebooted. For example:
    
    ARC MAX
    SIZE: 85.28GB
    ARC USED SIZE: 85.30GB
    
    Another possible symptom
    is that stacks similar to the following are seen in the system
    log (/var/log/messages), although in some cases these messages
    may not be present:
    
    kernel: INFO: task dbuf_evict:1008 blocked
    for more than 600 seconds.
    kernel: Tainted: P OE
    4.19.119-4c.el7.x86_64 #1
    kernel: "echo 0 >
    /proc/sys/kernel/hung_task_timeout_secs" disables this
    message.
    kernel: dbuf_evict D 0 1008 2 0x80000000
    kernel: Call
    Trace:
    kernel: ? __schedule+0x2ab/0x880
    kernel:
    schedule+0x32/0x80
    kernel:
    schedule_preempt_disabled+0xa/0x10
    kernel:
    __mutex_lock.isra.11+0x21b/0x4e0
    kernel: ?
    kmem_cache_free+0x10c/0x130
    kernel: ? cityhash4+0x78/0xa0
    [zfs]
    kernel: dbuf_hash_remove.constprop.18+0x53/0xf0
    [zfs]
    kernel: dbuf_destroy+0x13e/0x300 [zfs]
    kernel:
    dbuf_evict_one+0xe7/0x100 [zfs]
    kernel: ?
    dbuf_evict_one+0x100/0x100 [zfs]
    kernel:
    dbuf_evict_thread+0xa4/0x100 [zfs]
    kernel:
    thread_generic_wrapper+0x6f/0x80 [spl]
    kernel:
    kthread+0xf8/0x130
    kernel: ? __thread_exit+0x20/0x20
    [spl]
    kernel: ? kthread_bind+0x10/0x10
    kernel:
    ret_from_fork+0x35/0x40
    

Local fix

  • Reboot the vSnap to clear the hangs and run the backups again.
    To avoid triggering the hangs, try to reduce the amount of
    concurrent I/O on the vSnap server. This can be achieved using
    one of more of the following techniques:
    
    - Modify schedules of
    overlapping backup/replication jobs to prevent too many jobs
    from running at the same time
    - Modify schedule of the
    Maintenance job to make sure it runs during off-hours when
    other backup/replication jobs do not usually run
    - Modify VADP
    Proxy configuration by reducing the number of proxies or
    reducing the proxy softcap limit setting to prevent too many
    concurrent backup streams from running at the same time
    -
    Modify the Concurrent Backup setting on the Advanced Options
    page under the Disk management UI. Change the setting from
    "Unlimited" to "Limit" and set a limit value of 15 to throttle
    the amount of concurrent backup streams written to the vSnap
    

Problem summary

  • ****************************************************************
    * USERS AFFECTED:                                              *
    * IBM Spectrum Protect level 10.1.5 and 10.1.6                 *
    ****************************************************************
    * PROBLEM DESCRIPTION:                                         *
    * See ERROR DESCRIPTION                                        *
    ****************************************************************
    * RECOMMENDATION:                                              *
    * Apply the fixing level when available. This problem is       *
    * currently projected to be fixed in IBM Spectrum Protect Plus *
    * level 10.1.7. Note that this is subject to change at the     *
    * discretion of IBM.                                           *
    ****************************************************************
    

Problem conclusion

  • The hang was caused by a deadlock in the filesystem module in
    the kernel. The deadlock is caused by a lock ordering bug in the
    code path that prefetches dedup table (DDT) blocks into memory.
    This code path was enabled prior to 10.1.5 to speed up the
    loading of the DDT and thus improve performance. Starting with
    10.1.5 patch1, further improvements were made under APAR IT31918
    which introduced a new preload mechanism for the DDT. Given the
    newer improvements, the older prefetch logic is no longer
    necessary. The deadlock has thus been resolved by disabling the
    older prefetch code path.
    

Temporary fix

Comments

APAR Information

  • APAR number

    IT34064

  • Reported component name

    SP PLUS

  • Reported component ID

    5737SPLUS

  • Reported release

    A15

  • Status

    CLOSED PER

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2020-08-31

  • Closed date

    2020-11-13

  • Last modified date

    2020-11-19

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

  • Fixed component name

    SP PLUS

  • Fixed component ID

    5737SPLUS

Applicable component levels

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSNQFQ","label":"IBM Spectrum Protect Plus"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"A15","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
31 January 2024