Repeating SRCB6008000 or SRCB6008001 (CPPEA33) and SRCB6008003 (CPPEA35) can cause performance issues

Troubleshooting

Problem

In multipath environments, there can be IBM i performance issues when a connection is lost to a path. The performance issue can happen even if the repeated PALs are on a single port. This is due to repeated recovery of a connection error that keeps reoccurring and how IBM i handles path recovery.

Symptom

IBM i jobs may be delayed waiting on long I/Os and users might report performance issues when several B600800* PAL entries and CPPEA* messages are present. Although the average disk service time was acceptable, IBM i collection services data shows that some of the writes were taking > 1 second at the same time we see a spike in jobs waiting on I/O. The collection services data showed that the poor disk performance was isolated to one path (one virtual Fibre Channel 6B25 IOA adapter).

B6008000 or B6008001

PAL entries indicate a storage subsystem disk unit connection has failed. Look for other errors in the Service Action Log (SAL) or Product Activity Log (PAL).

msgCPPEA33 Warning - A storage subsystem disk unit connection has failed. A connection from I/O adapter &27 to storage subsystem disk unit &28 has failed. There are still &43 active connections to this disk unit. Recovery: Look for other errors related to this problem and report them to your hardware service provider.

B6008003

PAL entry indicates a storage subsystem disk unit connection was restored.

msgCPPEA35 - Informational only. A connection to a storage subsystem disk unit has been restored. A connection from I/O adapter &27 to storage subsystem disk unit &28 has been restored. There are now &43 active connections to this disk unit. Recovery: No action required.

Cause

A high volume of PAL entries and messages indicates that a connection is repeatedly failing and recovering. This repeated fail recovery scenario is what leads to degraded I/O performance problems. Each recovery causes a delay (2-3 seconds for DS8K, 30-120 seconds for SVC/Storwize). This repetitive accumulation of recoveries is what leads to extended performance problems. In comparison, this performance issues does not occur on AIX or Linux because they FAST FAIL their adapters. IBM i does not FAST their adapters because, in every other situation, it is best to recover if possible.

Environment

NPIV Connected External Storage

Virtual IOA Type 6B25

Virtual Fibre Channel (VFC)

Native IOAs (such as IOA Type 577D or 577F)

Diagnosing The Problem

To find root cause, IBM support cases should be opened with the Power Server Hardware team, PowerVM VIOS team (if VIOS owns the adapters) and the appropriate external storage team (see www.ibm.com/mysupport).

The connection error can be anywhere in the client's fibre fabric topology so a case should be opened with both the storage and power server hardware (or PowerVM VIOS) teams. Provide data logs for the hardware and storage teams to analyze and help identify the failing hardware path components for the client and SSR to fix. The client's SAN support team or vendor (IBM Brocade, Brocade, CISCO, Dell EMC, etc.) should be engaged to assist with problem determination.

Resolving The Problem

The temporary circumvention is to disable the port with the repeated errors until the root cause is identified and fixed.

Disabling the port with the known issue, should restore performance within a few of minutes. Verify first that there is at least one other active path. If one other path is active, then un-map the virtual adapter through VIOS, or disable the port on the switch (noting all of the connected partitions will lose that path so this may not be wanted). If VIOS is not involved, then you can reset the IOP or power off the domain through DST. Once the root cause is resolved, then the port can be enabled again.

Disable/Enable a path through SQL - requires IBM i 7.4 TR7 and IBM i 7.5 TR1 and above:

*Note: Depending on your storage type, disable of the port and data rerouting recovery may slow or halt the system for roughly 30 seconds to 2 minutes per disabled port.
CHANGE_DISK_PATHS procedure
https://www.ibm.com/docs/en/i/7.5?topic=services-change-disk-paths-procedure

Examples of SQL commands to disable or enable a path:

   Disable a path by storage WWPN:
   CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'DISABLE',
                               REMOTE_WWPN   => '5005xxxxxxxxxxxx')
       Enable a path by storage WWPN:
       CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'ENABLE',
                                   REMOTE_WWPN   => '5005xxxxxxxxxxxx')

   Disable a path by Local Host (IBM i) WWPN:
   CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'DISABLE',
                               HOST_WWPN   => 'c050xxxxxxxxxxxx')
       Enable a path by storage WWPN:
       CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'ENABLE',
                                   HOST_WWPN   => 'c050xxxxxxxxxxxx')

   Disable a path by IBM i adapter:
   CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'DISABLE',
                               ADAPTER_NAME => 'DCxx')

       Enable a path by IBM i adapter:
       CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'ENABLE',
                                   ADAPTER_NAME => 'DCxx')

   Disable a path by specific adapter port using remote WWPN:
   CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'DISABLE',
                               REMOTE_WWPN   => '5005xxxxxxxxxxxx',
                               ADAPTER_NAME => 'DCxx')

       Enable a path by specific adapter port using remote WWPN:
       CALL QSYS2.CHANGE_DISK_PATHS(OPERATION     => 'ENABLE',
                                   REMOTE_WWPN   => '5005xxxxxxxxxxxx',
                                   ADAPTER_NAME => 'DCxx')

If your code is not high enough to disable a single path or the issue is the adapter itself, then you can use the following steps to Reset (disable) or IPL (reset and then enable) the IOP adapter:
This assumes that there is at least 1 other active path to the disk units.
   1. STRSST and sign in
   2. Option 1. Start a service tool
   3. Option 7. Hardware service manager
   4. Option 2. Logical hardware resources (buses, IOPs, controllers,...)
   5. Option 1. System bus resources
   6. On the desired adapter, select option 6=I/O debug
   7. Select option 3. Reset I/O processor.
   *Note: Reset will 'Disable' the IOP and the IOP needs to be IPLed to enable again. Disable of the port and data rerouting recovery may slow or halt the system for roughly 30 seconds to 2 minutes per disabled port.
   *Note: Disk unit resources may take up to several minutes to start checking in. You can also attempt option 4 IPL I/O Processor which is similar to an IPL but just for the card.

Note: If the Power server adapter firmware is not at least February 2020, then the issue may very likely be due to HW/SAN issues (such as SFP low RX/TX power) that are not handled properly by the adapter firmware. An update, to the adapter firmware, may circumvent the performance issue, but the root cause will still need to be addressed. If VIOS is involved, make sure level 3.1.2.30 or later is being used to avoid known NPIV and path issues (see VIOS Hiper Fixes).

Related Information

CPPEA33

How to monitor for CPPEA33 and CPPEA35 messages

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"SWG60","label":"IBM i"},"ARM Category":[{"code":"a8m0z0000000CKiAAM","label":"Performance-\u003EExternal Storage"}],"ARM Case Number":"","Platform":[{"code":"PF012","label":"IBM i"}],"Version":"All Versions"}]

Tips