IBM Support

Potential data loss scenario when using compressed volumes on SVC and Storwize V7000 running software versions 7.3, 7.4 or 7.5

Flashes (Alerts)


Abstract

IBM has identified a low-probability scenario in which a region of a compressed volume can experience detected data loss, affecting a 32KB region of compressed data.

Content

Problem description

SVC and Storwize V7000, running 7.3, 7.4 or 7.5 software, are susceptible to an issue which may cause detected data loss, on a region of a compressed volume (APAR HU00898).

If data is read from an affected region, this will cause the volume to be taken offline. IBM Support assistance is then needed to restore access to the volume (as the standard compressed volume repair procedure will not work).

The issue is fixed in the 7.3.0.12, 7.4.0.6 and 7.5.0.3 PTF releases.

An improved recovery procedure for this issue is introduced by APAR HU00980. This APAR is included in the 7.4.0.7 PTF release, and is expected to be included in future 7.3 and 7.5 PTF releases.

Details

It is possible for a 32KB compressed data block to be referenced by two different locations on the volume (a "double allocation"). If one of these two locations is accessed, an invalid data identifier will be detected, and the volume will be taken offline in order to prevent further corruption.

This double allocation of a compressed data block can only occur very rarely - it depends on the value of a counter which is incremented on each write to the compressed volume. Each compressed volume keeps its own counter. It may first occur when the counter reaches a value of around 536 million. It can then recur when the counter reaches around 4.8 billion writes, and then again every 4.3 billion writes after that. The problem only occurs with certain I/O patterns and timings, further reducing the probability of experiencing the issue on any given volume.

Volumes created with a fixed software version are not exposed to the issue.

Remediation/fixes

The 7.3.0.12, 7.4.0.6 and 7.5.0.3 PTF releases contain a fix for APAR HU00898, meaning no new corruption to compressed volumes will be caused. However, this fix does not repair any volumes which have already experienced this issue before upgrading to these versions.

The 7.4.0.7 PTF release contains an improved recovery procedure (APAR HU00980), allowing it to report to the user when a volume may be exposed to this issue, and allows all valid data to be recovered. Future 7.3.0 and 7.5.0 PTF releases are expected to also include this enhancement.

This procedure works by creating a second volume copy. All valid data from the volume will be copied to this second volume copy. On completion, the user will be able to determine whether or not the volume suffered data loss due to this issue.

To follow this procedure, you will need CLI access to the system via SSH.

Note: there is a small chance (around 1 in 200) that if a volume has experienced this issue, it may be taken offline during step 3b of the procedure below. In this case a new "corrupt compressed volume" event will appear in the eventlog; please contact IBM Support immediately for assistance with repairing the volume, to bring it online.

Repair procedure

(1) Install the 7.4.0.7 software
If the system is already running 7.5.0.x software, the procedure can only be carried out once a 7.5 PTF release is installed that includes this feature.

(2) Issue the 'recovervdiskbysystem -analyzecompressedvolumes' CLI command
This new CLI command causes the SVC software to create entries in the eventlog for volumes which may be affected. The command will complete within a few seconds.

    Note: this CLI command should only be issued once. As soon as the system is running a software version containing the fix for APAR HU00898, no more volumes can become affected by this issue. The command does not need to be issued again to re-check the compressed volumes.

When the CLI command has completed, view the system eventlog using the GUI or CLI. One or more events may be logged in the eventlog as follows:
      Error code: 1863
      Event ID: 986105
      Description: Compressed volume copy may require a repair procedure
      Directed Maintenance Procedure: The system has detected this volume copy could be affected by a data loss issue, APAR HU00898. Please see the following webpage for details of how to verify whether the volume copy is affected, and how to repair it.
      http://www-01.ibm.com/support/docview.wss?uid=ssg1S1005361
If no such events are logged following the recovervdiskbysystem -analyzecompressedvolumes CLI command, then no volumes are exposed to the issue, and so no further action is needed.

One event will be logged for each volume copy which may have experienced this issue. To verify whether any data has actually been lost, follow the remainder of the procedure.

(3) For each event with error code 1863, perform the following six steps:

(a) For the relevant volume, find out whether a second copy exists using the "lsvdiskcopy <volume-id>" CLI command
  • If an uncompressed copy exists and is in sync, then delete the compressed copy, and use the uncompressed copy (which cannot be affected by APAR HU00898).
    No further action is needed for this volume.
  • If two compressed copies exist, delete or split one of the copies, and move on to the next step. If only one copy has been identified as being exposed, then delete the copy that is exposed; no further action is required for this volume. Otherwise, continue to step 3b.
  • If only one compressed copy exists, move on to step 3b.

(b) Create a second volume copy using the following CLI command:

addvdiskcopy -mdiskgrp <mdiskgrp_id> -ignoresyncerrors <volume_id>


If desired, this vdiskcopy can be created as compressed or thin-provisioned.
The -ignoresyncerrors option changes the system behavior during volume copy synchronization. If data loss caused by APAR HU00898 is detected during synchronization, the volume would normally be taken offline - but with this option, a medium error will instead appear in the eventlog, and the synchronization process will skip the lost 32KB data block, and then continue synchronization.

Ensure there is enough free space in the given MDisk group (storage pool) for the new volume copy.


(c) Monitor the progress of the array synchronization using the lsvdisksyncprogress CLI command

# lsvdisksyncprogress
vdisk_id vdisk_name copy_id progress estimated_completion_time
9 vdisk9 1 0 150921182521

The progress will increase from 0 to 100, at which point the copy is complete, and the volume will disappear from the lsvdisksyncprogress output.
Verify that the copy has completed by checking that lsvdiskcopy shows both copies in sync:


# lsvdiskcopy 9
vdisk_id vdisk_name copy_id status sync primary mdisk_grp_id mdisk_grp_name capacity type se_copy easy_tier easy_tier_status compressed_copy
9 vdisk9 0 online yes yes 0 mdiskgrp0 100.00MB striped no on balanced no
9 vdisk9 1 online yes no 0 mdiskgrp0 100.00MB striped no on balanced no


(d) Check the eventlog to find out whether any medium errors (error code 1320) were reported. for the volume copy
If they were, the event sense data indicates the location of the data that was lost. Continue to 3e.

If no medium errors are reported, then no data has been lost from the volume. In this case, no further action is required, and the user can choose to keep both copies or either one of them.

(e) Split the original volume copy to become a separate volume (and keep a note of the new volume ID)

# splitvdiskcopy -copy <original_copy_id> -name <to_delete_volume_name> <volume_id>
Virtual Disk, id [12], successfully created


Once it has been verified that the volume is working correctly, the split volume (<to_delete_volume_name>) can be deleted (in this case, using "rmvdisk 12"). 

The split volume should be named appropriately, to indicate it should be deleted after verification of the new volume copy.

(f) Use the GUI to mark error code 1863 as fixed for this volume.

Repeat this procedure for each volume that has reported error code 1863

[{"Product":{"code":"STPVGU","label":"SAN Volume Controller"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":"7.5","Platform":[{"code":"","label":"SAN Volume Controller"}],"Version":"7.3;7.4;7.5","Edition":"N\/A","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"STPVGU","label":"SAN Volume Controller"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"ST5Q4U","label":"IBM Storwize V7000 Unified (2073)"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}},{"Product":{"code":"ST3FR7","label":"IBM Storwize V7000"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Component":" ","Platform":[{"code":"","label":""}],"Version":"","Edition":"","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
28 March 2023

UID

ssg1S1005361