IBM Support

Receiving a Predict Failure Analysis (PFA) alert on a hard drive attached to the ServeRAID controller - Servers and IntelliStation

Troubleshooting


Problem

ServeRAID recovery from PFA alerts on hard drives.

Resolving The Problem

Overview

IBM eServer xSeries SCSI hard disk drives support Predictive Failure Analysis (PFA). This specification is designed into the hard disk drives to internally monitor and diagnose a likely failure within the near future. If a disk issues a PFA event , the ServeRAID controller will capture and forward the event locally to the ServeRAID Management console and agent. ServeRAID Manager and Agent software supports industry standard SNMP for alert automation. If a PFA alert occurs on an IBM xSeries hard disk drives, it is eligible for replacement under the warranty terms and conditions.

Recovering from a hard disk drive PFA Alert on a ServeRAID controller may require multiple steps depending on the ServeRAID configuration. Ensure you have a recent backup of the data before proceeding.

If the disk that PFA's is configured in a redundant logical drive (RAID level-1, 1E, 5, 5E, 5EE, 10, 1E0, or 50), recovery is a three step process. The first step is to prepare the hard disk drive with an active PFA Alert for replacement. The second step is to replace the defunct drive, and the third step is to regenerate the "Critical" logical drive back to an online or "Okay" state. If a qualified Hot Spare or Standby Hot Spare is configured, ServeRAID will automatically regenerate a "Critical" logical drive to an "Okay" state.

If the disk that PFA's is configured in a non-redundant logical drive (RAID level-0, or 00), all data on the logical drive will be lost when the disk is eventually marked defunct. The first step is to backup the data on the logical drive as soon as possible. The second step is to prepare the hard disk drive with an active PFA Alert for replacement. The third step is to replace the PFA drive, and the fourth step is to recover the "Offline" logical drive. This is done by removing the affected "Offline" logical drive then recreate the non-redundant logical drive. The final step is to restore the data from a recent backup.

Preparing to replace a hard drive with an active PFA Alert
  1. While the system is powered on, identify the PFA drive to be replaced in one of the following ways:

    Using ServeRAID Manager:
    1. Start the ServeRAID Manager software application either locally on the system or use ServeRAID Manager installed on a client to attach remotely to the Server.
    2. Review the recent event entries into the ServeRAID Event Viewer (bottom panel). Any PFA events should have a corresponding entry that will identify the PFA disk drive by Controller, Channel and SCSI ID.
    3. Locate the physical PFA disk drive or use ServeRAID Manager's "Identify Drive" feature. This will flash the drive LED and make it easier to locate the physical drive in larger configurations.
  2. Using IPSSEND Commands:

    1. An alternate method of identifying the PFA disk drive is to use IPSSEND GETCONFIG command to collect the configuration logs and manually search the output for the PFA (Yes/No) field for each disk drive. Below is an example of the output that will identify the Controller, Channel, and SCSI ID of the PFA disk drive.

      Read Configuration has been initiated for
      Controller 1
      Channel #3:
      Target on SCSI ID 3
      SCSI ID : 3 PFA (Yes/No): Yes
      State : Online (ONL)
      Size (in MB)/(in Sectors): 17357/35548048
      Device ID : IBM-ESXSST31843632813BM0KK8E

    2. Locate the physical PFA drive or use the IPSSEND IDENTIFY command to flash the drive LED and make it easier to locate the physical drive in larger configurations.
  3. Once the physical drive is identified backup the data on the appropriate ServeRAID logical drive.
  4. Prepare the hard disk drive for removal by marking the drive defunct. This can be done in several ways by using the ServeRAID Manager application. Locate the PFA disk drive, then right click on the drive, and set the drive to Defunct Using the IPSSEND SETSTATE command. The new state should be set to DDD (Defunct Disk Drive).
Replacing a defunct hard disk drive
Replacing a defunct hard disk drive attached to a Hot-Swap Backplane (internal backplane or EXP Storage Enclosure)
  1. While the system is powered on, identify the defunct physical drive to be replaced in one of the following ways: A solid Amber drive LED indicator will be on Using ServeRAID Manager, right click on the defunct drive and choose "Identify Drive". This will flash the drive LED and make it easier to locate the physical drive in larger configurations.
  2. Without removing the drive completely, gently remove the physical drive from hot-swap backplane, using the handle of the hot-swap tray.
  3. Wait 45 seconds to allow the hard disk drive motor to completely stop spinning.
  4. Remove the defunct drive and insert the replacement hard disk drive that is the same size (or larger) and use the handle of the hot-swap tray to secure the new drive into place against the hot-swap backplane.
  5. Within a few minutes, ServeRAID should detect the hot-swap event and, depending on the configuration, take one of the following actions:
    • If no Hot Spare or Standby Hot Spare was previously defined, ServeRAID should automatically initiate a rebuild to the replacement disk
    • If a Hot Spare or Standby Hot Spare was previously defined, ServeRAID should automatically set the state of the replacement disk to a Hot Spare or Standby Hot Spare respectively
    • If the hot-swap rebuild BIOS setting has been disabled, you will need to manually rebuild the hard disk drive
  6. Observe the drive for normal device startup behavior and LED activity:
    • The replacement hard disk will spin up
    • The activity LED (green drive LED) should begin flashing and may eventually turn off or become very busy depending on the configuration.
    • The solid Amber drive LED indicator should turn off

Note: These state transitions can be observed in real time using ServeRAID Manager.

Optional: These same instructions can be used for an Offline hard disk drive replacement by booting to the IBM ServeRAID Support CD, if desired.

Replacing a hard disk drive attached to a standard SCSI cable
  1. Determine the Channel and SCSI ID of the defunct disk drive using ServeRAID Manager or IPSSEND GETCONFIG command.
    Note:
    If an OS is not bootable to use these tools, boot the IBM ServeRAID Support CD to determine the Channel and SCSI ID, or the bootable Command Line diskette to run the IPSSEND GETCONFIG command.
  2. Power the ServeRAID system off, and open the system chassis.
  3. Locate the defunct hard disk drives by checking the physically attached channel and SCSI ID jumper settings for each drive.
    Note:
    SCSI ID jumper settings are usually labeled on the drive.
  4. Disconnect the SCSI cable and power cable from the defunct drive, and remove the drive from the system.
  5. Configure the SCSI ID jumper settings for the replacement disk drive to the same SCSI ID as the defunct drive.
  6. Reconnect the SCSI cable and power cable.
  7. Review Best Practices for hard drives attached to SCSI cables.
  8. Close the system chassis.
  9. Power on the ServeRAID system.
  10. Verify the ServeRAID POST Banner properly detects the replaced drive.
  11. Within a few minutes, ServeRAID should detect the cold-swap event and, depending on the configuration, take one of the following actions:
  • If no Hot Spare or Standby Hot Spare was previously defined, ServeRAID should automatically initiate a rebuild to the replacement disk.
  • If a Hot Spare or Standby Hot Spare was previously defined, ServeRAID should automatically set the state of the replacement disk to a Hot Spare or Standby Hot Spare respectively.
  • If the hot-swap rebuild BIOS setting has been disabled, you will need to manually rebuild the hard disk drive.
Best Practices for Hard Drives attached to SCSI cables
  • Use internal SCSI cables with embedded terminators at the end of the cable, whenever possible.
  • Attach devices to the cable starting with the connector closest to the SCSI terminator (end of the cable) and work your way forward to the connectors closest to the controller.
  • Each device attached to the SCSI cable must have a unique SCSI ID settings.
  • The last device on a SCSI cable must terminate the SCSI bus. If there is an embedded terminator on the cable, ensure all other attached devices are NOT configured to provide termination. If the cable is not terminated, the device attached to the end of the cable must be jumpered/configured to provide termination.
Initiating a Rebuild using the ServeRAID Manager program or the IBM ServeRAID Support CD

Use ServeRAID Manager to manually initiate a rebuild operation or change the state of the drive to a hot spare. This is done by right clicking on the defunct drive that was recently replaced and select the rebuild operation or to change the state of the disk as appropriate.

Initiating a Rebuild using IPSSEND

The same commands can be initiated by using the IPSSEND command line tool. The IPSSEND REBUILD command or the IPSSEND SETSTATE command will force a rebuild operation to start or change the state of the disk respectively. Run these commands without switches to see the complete syntax for the commands. This utility is located on the IBM ServeRAID Support CD.

Note: Hot spare restriction - a HSP cannot be used if its capacity is smaller than that of the failed disk drive.

ServeRAID Manager screen during a rebuild

ServeRAID Manager during rebuild

Additional information
Need more help?
Please select one of the the following options for further assistance:

Contact your local technical Support Center
Return to the main Troubleshooting page

 

Document Location

Worldwide

Operating System

IntelliStation Pro:All operating systems listed

System x:All operating systems listed

Older System x:All operating systems listed

[{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW189","label":"Older System x->xSeries 360"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18G","label":"Older System x->xSeries 200"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18H","label":"Older System x->xSeries 220"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18J","label":"Older System x->xSeries 230"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18K","label":"Older System x->xSeries 240"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18L","label":"Older System x->xSeries 330"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18M","label":"Older System x->xSeries 340"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18Q","label":"System x->xSeries 232"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18R","label":"Older System x->xSeries 250"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18T","label":"Older System x->xSeries 300"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18U","label":"Older System x->xSeries 342"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18V","label":"Older System x->xSeries 350"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18W","label":"Older System x->xSeries 370"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW18X","label":"Older System x->xSeries 380"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW191","label":"Older System x->xSeries 440"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19K","label":"Older System x->xSeries 255"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19L","label":"Older System x->xSeries 343"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19M","label":"Older System x->xSeries 205"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19P","label":"Older System x->xSeries 235"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19Q","label":"Older System x->xSeries 305"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19R","label":"Older System x->xSeries 335"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19S","label":"Older System x->xSeries 345"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19T","label":"Older System x->xSeries 225"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW19U","label":"Older System x->xSeries 445"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20A","label":"Older System x->xSeries 382"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20E","label":"Older System x->xSeries 365"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20J","label":"Older System x->xSeries 206"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20K","label":"Older System x->xSeries 306"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20N","label":"Older System x->xSeries 226"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20P","label":"Older System x->xSeries 336"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20Q","label":"Older System x->xSeries 346"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20R","label":"Older System x->xSeries 236"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW20Y","label":"Older System x->xSeries 366"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW21H","label":"Older System x->xSeries 260"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW21M","label":"Older System x->xSeries 206m"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"HW21N","label":"xSeries 306m"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU016","label":"Multiple Vendor Support"},"Product":{"code":"SUNSET","label":"PRODUCT REMOVED"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HWP01","label":"IntelliStation Pro->IntelliStation M Pro"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HWP02","label":"IntelliStation Pro->IntelliStation Z Pro"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HWP03","label":"IntelliStation Pro->IntelliStation E Pro"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}},{"Type":"HW","Business Unit":{"code":"BU054","label":"Systems w\/TPS"},"Product":{"code":"HWP99","label":"IntelliStation Pro->IntelliStation R Pro"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Line of Business":{"code":"","label":""}}]

Document Information

Modified date:
29 January 2019

UID

ibm1MIGR-58230