IBM Support

Potential Undetected Data Corruption during array rebuilds of RAID arrays containing NVMe drives

Flashes (Alerts)


Abstract

In the unlikely event that an NVMe drive loses a small amount of data (known as a medium error) and this medium error is discovered for the first time during an array rebuild, an array may store bad data on that drive rather than repairing the data.

The system is continuously checking for and automatically repairing medium errors – so the likelihood of a medium error being discovered for the first time during a rebuild is very low.

If your system hits this issue, an error will be logged in the event log within 7 days.

This issue does not affect arrays containing SAS drives.

Content

Background Information
A drive medium error occurs if a drive has an internal failure and can no longer return valid data that had been stored on the drive. 
Array rebuilds occur under the following scenarios:
  • Full rebuilds occur after normal drive failures.
  • Partial rebuilds occur after drive upgrades.
  • Partial rebuilds occur if a drive is temporarily running slower than the rest of the RAID array.
The system has a RAID scrub technology which reads and validates all data in the RAID array every 7 days.
Behaviour of NVMe system when detecting a drive medium error
  • If the system detects a drive medium error during normal operation (not during a rebuild) then the system will correctly reconstruct the data and permanently repair the medium error so that the issue cannot occur.
    • The RAID scrub will ensure that all medium errors are detected and repaired within 7 days of occurrence.
  • If the system detects  a drive medium error during a rebuild, incorrectdata will be written to a drive causing an undetected data corruption.
    • This condition should only occur if the medium error was created by the drive within the last 7 days, and has not yet been found by the RAID scrub.
    • In this case, the RAID scrub will read that incorrect data within 7 days and log a 1691 error indicating that the array's data and parity information do not match.
    • If no 1691 errors are present in the error log, then this issue has not occurred.
The issue only affects NVMe drives, and not SAS drives.
Mitigation or Fix

This issue affects all software releases that support NVMe drives and was fixed under APAR HU02186 in v8.2.1.11 and v8.3.1.2, however subsequent changes have caused this fix to no longer be effective in specific releases v8.2.1.12 and v8.3.1.4.

The fix for APAR HU02186 is present in releases v8.2.1.13 and v8.3.1.5 (or later) which are now available on IBM Fix Central.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STSLR9","label":"IBM FlashSystem 9x00"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STSLR9","label":"IBM FlashSystem 9x00"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSA76Z4","label":"IBM FlashSystem 7x00"},"ARM Category":[],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"ST3FR7","label":"IBM Storwize V7000"},"ARM Category":[{"code":"a8m0z000000bqQoAAI","label":"A-V7000"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"STHGUJ","label":"IBM Storwize V5000"},"ARM Category":[{"code":"a8m0z000000bqP7AAI","label":"A-V5000"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB26","label":"Storage"}},{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"ST3FR9","label":"IBM FlashSystem 5000"},"ARM Category":[{"code":"a8m0z000000bqP7AAI","label":"A-V5000"}],"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"All Version(s)","Line of Business":{"code":"LOB26","label":"Storage"}}]

Document Information

Modified date:
28 March 2023

UID

ibm16254742