IBM Support

Potential Loss of Data from RtC Compressed Volumes on 8.2.1 or 8.3.0 Software

Flashes (Alerts)


Abstract

APAR HU02064 is a software issue that can cause detected data loss on RtC (Real-time Compression) volumes, affecting a 32 KB region of compressed data.

This can occur on SVC, V9000 and V7000 gen2/gen2+ systems, running 8.2.1.0-8.2.1.6 or 8.3.0.0 software.

The issue does not affect other hardware platforms and software levels, or compressed volumes in a Data Reduction Pool.

Content

Problem description
APAR HU02064 can cause detected data loss from a 32KB compressed data block.

If data is read from an affected region, this will cause the volume to be taken offline to prevent incorrect data being returned to the host. A volume repair operation is required to bring the volume online.
The potential data loss occurs at the point where data is written to a compressed volume.  If a low-probability combination of data pattern and timing factors occurs, then incorrectly compressed data will be stored on disk.  The problem will only be detected when the incorrect data is read.
Fix
The fix for APAR HU02064, in the 8.2.1.8 and 8.3.0.1 PTFs, will prevent any further data loss occurring from this issue.
  • Systems running 8.2.1.x with RtC compressed volumes should upgrade to the 8.2.1.8 PTF.
  • Systems running 8.3.0.0 with RtC compressed volumes should upgrade to the 8.3.0.1 PTF.

How to confirm whether any volumes are affected, and prevent loss of access to affected volumes
There is a procedure which can confirm whether any compressed volumes are affected by this issue, and also protect against volumes being taken offline when the problem is detected.
Note: this procedure is optional, and only needed if you wish to be sure that no volumes are affected.
  • For each compressed volume, the risk of being affected by this issue, if the volume has not already been taken offline, is very low. 
  • In most cases, if incorrect compressed data is written, it is then detected soon afterwards, due to the host reading the incorrect data - causing the volume to be taken offline until it is repaired.
  • If host backups have been reading the compressed volume, then it is very unlikely the volume has been affected (because the problem would already have been detected when the incorrect data block was read). 
The procedure works by using volume mirroring to add a second volume copy. All valid data from the volume will be copied to this second volume copy, converting any lost data to a medium error instead of taking the volume offline.  To follow this procedure, you will need CLI access to the system via SSH.
Upgrade to software with the fix for APAR HU02064 before beginning this procedure.
For each RtC compressed volume, perform the following steps:
1. Find out whether this volume has two copies using the "lsvdiskcopy <volume-id>" CLI command
  • If two copies exist, and only one is compressed, then delete the compressed copy and recreate it. This will copy the data from the uncompressed copy, which cannot be affected by this issue. Then move on to the next volume.
  • If two copies exist, and both are compressed, no action is required. If one copy becomes offline due to corruption in the future, remove that copy and create a new copy to restore redundancy.
  • If only one copy exists and it is compressed, move on to step 2.
2. Create a second volume copy using the following CLI command:
   addvdiskcopy -syncrate 100 -mdiskgrp <mdiskgrp_id> -ignoresyncerrors -autodelete <volume_id>

This action must be completed using the CLI and not the GUI. ​​​​If required, the new volume copy can be created as thin-provisioned by adding "-rsize 2% -autoexpand", or compressed by adding "-rsize 2% -compressed -autoexpand".

Note that autoexpand must be enabled for thin or compressed copies, otherwise the new copy will go offline out-of-space.

Notes:

  • Ensure there is enough free space in the given MDisk group (storage pool) for the new volume copy.
  • The old volume copy will automatically be deleted once the data has been copied. Either maintain a list of volumes that have been copied, or rename the volume as each new copy is created, so that you know which volumes have already been dealt with.
  • Sync rate of 100 will copy attempt to copy data at 64 MB/sec (depending on MDisk performance). If you are mirroring multiple volumes at once, ensure that the MDisk is capable of the total sync rate for the volumes being mirrored. You can reduce the sync rate by using the "chvdisk -syncrate X" CLI command
  • The -ignoresyncerrors option changes the system behavior during volume copy synchronization. If data loss is detected during synchronization, the volume would normally be taken offline - but with this option, a medium error will instead appear in the eventlog. The synchronization process will skip the lost 32KB data block, and then continue copying the remainder of the volume.

3. Monitor the progress of the volume synchronization using the lsvdisksyncprogress CLI command

> lsvdisksyncprogress
vdisk_id vdisk_name copy_id progress estimated_completion_time
9 vdisk9 1 0 191101182521

The progress will increase from 0 to 100, at which point the copy is complete, and the volume will disappear from the lsvdisksyncprogress output. The old volume copy will automatically be deleted.
4. Repeat steps 1-4 for each RtC compressed volume.
5. When all volumes have completed synchronization, check the eventlog for an 1840 ("MDisk bad blocks exist") error.
  • If there is an 1840 error, use the GUI "run fix" button to produce a list of volumes and LBAs where data has been lost.
  • If there is not an 1840 error, then no volumes were affected by APAR HU02064.
Contact IBM support if further assistance is required with confirming which volumes are affected.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STPVGU","label":"SAN Volume Controller"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.2.1;8.3.0","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"ST3FR7","label":"IBM Storwize V7000"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.2.1;8.3.0","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"STKMQV","label":"IBM FlashSystem V9000"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"8.2.1","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
28 March 2023

UID

ibm11099977