Incorrect data on logical drive due to media handling error

Troubleshooting

Problem

CONSISTENCY CHECK/BACKGROUND INITIALIZATION: Impact on a RAID-1: MegaRAID reads LBA (Logical Block Address) X from drive#1 and receives incorrect data due to this root cause. MegaRAID reads LBA X from the mirror drive and receives correct data. CC/BGI reports the inconsistency and corrects it by writing the incorrect data from drive#1 to the mirror drive. The mirror drive now has incorrect data. Impact on a RAID-5/6: For a RAID-5 or a RAID-6, parity calculation will be used to reconcile the data. This will prevent these RAID levels from seeing the issue except under the following conditions, RAID-5 two media errors exits on the same LBAs of different drives. For a RAID-6 to exhibit the issue 3 drives must contain media errors on the same LBA.

Resolving The Problem

Source

RETAIN tip: H194505

Symptom

CONSISTENCY CHECK/BACKGROUND INITIALIZATION:

Impact on a RAID-1:

	MegaRAID reads LBA (Logical Block Address) X from drive#1 and receives incorrect data due to this root cause. MegaRAID reads LBA X from the mirror drive and receives correct data. CC/BGI reports the inconsistency and corrects it by writing the incorrect data from drive#1 to the mirror drive. The mirror drive now has incorrect data.

Impact on a RAID-5/6:

	For a RAID-5 or a RAID-6, parity calculation will be used to reconcile the data. This will prevent these RAID levels from seeing the issue except under the following conditions, RAID-5 two media errors exits on the same LBAs of different drives. For a RAID-6 to exhibit the issue 3 drives must contain media errors on the same LBA.

PATROL READ:

Impact on a RAID-1:

	Two scenarios are possible: MegaRAID reads LBA X from drive #1 and receives incorrect data due to this root cause. Since Patrol Read does not see a medium error, it will continue without correcting the medium error. No data integrity issue here. MegaRAID reads LBA X from drive #1 and gets a medium error. MegaRAID reads LBA X from the mirror drive and hits this root cause and receives incorrect data. Patrol read recovers the medium error by writing the incorrect data from the mirror drive to drive #1. Drive #1 now has incorrect data.

Impact on a RAID-5/6:

	For a RAID-5 or a RAID-6, parity calculation will be used to reconcile the data. This will prevent these RAID levels from seeing the issue except under the following conditions, RAID-5 two media errors exits on the same LBAs of different drives. For a RAID-6 to exhibit the issue 3 drives must contain media errors on the same LBA.

Affected configurations

This tip is not hardware specific.

The system is configured with one or more of the following IBM Options:

ServeRAID-MR10M SAS/SATA Controller, Option 43W4339
ServeRAID-MR10i SAS/SATA Controller, Option 43W4296
ServeRAID-MR10k SAS/SATA Controller, Option 43W4280

This tip is not software specific.

Solution

The fix will be implemented in firmware version 11.0.1-0008 or newer.

The target date for this release is scheduled for March 20, 2009.

Additional information

The issue is limited to the following:

Users with SATA drives.
Redundant RAID levels.
Logical drives that have undergone a Rebuild, Check Consistency, Background Initialization or Patrol Read.

For SATA drives, if a command to read 'Y' blocks of data returns with a media error at offset 'X' (X < Y), MegaRAID firmware follows the below sequence:

	1a. Device layer reissues the command to read blocks '0 to X-1'. 1b. Device layer sets the recovery and SATA media error recovery active flags for the above command. 1c. Since the Medium Error is at location X, this read command returns with status OK. 1d. As medium error recovery is active on this command, Device layer returns this command to RAID with Medium Error status. 1e. RAID layer will put the medium error LBA X to bad block list and issue a command to read 'X+1 to Y-1' blocks. 1f. Once the read in step 1e is complete, RAID layer will continue with the recovery for bad block at X.

However, if the command in step 1a gets aborted, that is a command that is not returned from the drive back to the chip and thus is terminated at the chip; the above sequence will be changed:

	2a. Device layer reissues a command to read the blocks '0 to X-1'. 2b. Command in step 2a gets aborted. 2c. Device layer clears the recovery flag on this command. 2d. Device layer reissues the command to read '0 to X-1' blocks again. 2e. Command in step 2d is completed for the RAID layer with status OK.

Since the recovery flag in the command is cleared at step 2c above, Device layer does not know that the command in step 2d is issued as part of a Medium Error recovery and an OK status is returned back to the RAID layer. As a result, we skip steps 1e and 1f and the RAID layer returns undetermined data back to the Host and fails to puncture the bad block X at the rebuilt drive.

End user impact:

Rebuild:

	MegaRAID does not puncture the bad block X at the rebuilt drive and incorrect data from the source drive will be written to the rebuilt drive.

Reconstruction:

Same expose as with the rebuilding scenario.

	Same expose as with the rebuilding scenario.

Document Location

Worldwide

Operating System

System x Hardware Options:Operating system independent / None

[{"Type":"HW","Business Unit":{"code":"BU051","label":"N\/A"},"Product":{"code":"SUPPORT","label":"IBM Worldwide Support"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"LOB33","label":"N\/A"}}]

Tips

Incorrect data on logical drive due to media handling error - IBM Servers

Troubleshooting

Problem

Resolving The Problem

Source

Symptom

Affected configurations

Solution

Document Location

Operating System

Was this topic helpful?

Document Information

UID

Share your feedback

Need support?