IBM Support

What Causes a DMA_ERR on AIX host?

Question & Answer


Question

What Causes a DMA_ERR to be logged in AIX's errpt?

Sample error details.

LABEL:		DMA_ERR
IDENTIFIER:	02D12055

Date/Time:       Tue Nov  6 15:34:14 2018
Sequence Number: 277275
Machine Id:      <Machine_ID>
Node Id:         <AIX_system_name>
Class:           S
Type:            UNKN
WPAR:            Global
Resource Name:   VIODMA

Description
UNDETERMINED ERROR

Probable Causes
SYSTEM I/O BUS
SOFTWARE PROGRAM
ADAPTER
DEVICE

	Recommended Actions
	PERFORM PROBLEM DETERMINATION PROCEDURES

Detail Data
BUS ID
9000 0340                                                                        [...@                            ]
EVENT CODE
5000 0000                                                                        [P...                            ]
LINK REGISTER
F100 0000 C011 5E60                                                              [......^`                        ]
SOURCE LOCATION
line: 2521 file: d_map_util.c
PROBLEM DATA
0000 0000 1000 0067 0000 0000 0000 4000 F100 0A0F 501E 1200 0000 0000 0000 0000  [.......g......@.....P...........]
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000  [................................]

Cause

This technote lists some of the most common probable causes.

Answer

Probable Cause #1

The max_xfer_size value of a virtual Fibre Channel adapter was increased from the default value.

Determine if the AIX host has a virtual Fibre Channel adapter configured with a non-default max_xfer_size attribute.
Below is an example of the default value for fcs0:
o

image-20181109154430-2

If max_xfer_size was increased from the default (0x100000), change the attribute value back to default and monitor the AIX error report.

For more details, see AIX/VIOS Disk and Adapter IO Queue Tuning.


Probable Cause #2

The error was logged for an invalid device.
In the following sample error, details in the PROBLEM DATA make reference to device name, ent5, which was no longer configured on the AIX host.  In this scenario, ent5 was an old etherchannel adapter that had been configured and later removed in order to do a live partition mobility.

LABEL:           DMA_ERR
IDENTIFIER:      C258BB1A
Date/Time:       Tue Aug 11 18:17:35 2020
Sequence Number: 1467030
Machine Id:      <Machine_ID>
Node Id:         <AIX_system_name>
Class:           S
Type:            INFO
WPAR:            Global
Resource Name:   VIODMA
Description
UNDETERMINED ERROR
Probable Causes
SYSTEM I/O BUS
SOFTWARE PROGRAM
ADAPTER
DEVICE
    Recommended Actions
    PERFORM PROBLEM DETERMINATION PROCEDURES
Detail Data
BUS ID
9000 0340                                                [...@                            ]
EVENT CODE
1000 0004                                                [....                            ]
LINK REGISTER
0000 0000 0000 0000                                      [........                        ]
SOURCE LOCATION
line: 3035 file: vdev_bus.c
PROBLEM DATA
0000 0000 0000 1801 0000 0000 1000 0008 656E 7435 2E70 6879 706D 656D 0000 0000  [................ent5.phypmem....]
0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000  [................................]
  ent0   Available       U9009.42A.786C5E0-V50-C2-T1  Virtual I/O Ethernet Adapter (l-lan)  ent1   Available       U9009.42A.786C5E0-V50-C4-T1  Virtual I/O Ethernet Adapter (l-lan)  ent2   Available       U9009.42A.786C5E0-V50-C3-T1  Virtual I/O Ethernet Adapter (l-lan)  ent6   Available                                    EtherChannel / IEEE 802.3ad Link Aggregation

Probable Cause #3

DMA_ERR error was logged on the PowerVM Virtual I/O Server (VIOS) and AIX NPIV client around the same time.
Use summ diagnostics tool to parse the VIOS and AIX error log for decoding. The tool can be used on the VIOS via oem_setup_env shell. 
 
Virtual I/O Server:
    Jan 21 20:03:51 PCIDMA     I DMA_ERR             90000383 40000003 00000000074D7724 [line: 2513 file: d_map.c]
AIX client errors (from latest to oldest):
   Jan 21 20:26:42 VIODMA     I DMA_ERR             90000340 20000001 F1000000C011852C [line: 3799 file: vdev_bus.c]
   Jan 21 20:26:42 VIODMA     I DMA_ERR             90000340 20000000 F1000000C011852C [line: 3757 file: vdev_bus.c]
   Jan 21 20:26:42 VIODMA     I DMA_ERR             90000340 20000001 F1000000C010774C [line: 3799 file: vdev_bus.c]
   Jan 21 20:26:42 VIODMA     I DMA_ERR             90000340 20000000 F1000000C010774C [line: 3757 file: vdev_bus.c]
   Jan 21 20:26:41 fscsi3     I FCP_ERR10           Unknown error number: 00000401
   Jan 21 20:26:41 fscsi3     I FCP_ERR10           Driver is closing with unlogged errors for target port 0x021101
   Jan 21 20:04:38 hdisk23    P SC_DISK_ERR7   path  6 path failure; INQUIRY          adapter HW failure
   ...
   Jan 21 20:04:37 hdisk1     P SC_DISK_ERR7   path  7 path failure; INQUIRY          adapter HW failure
   Jan 21 20:03:52 hdisk7     P SC_DISK_ERR7   path  7 path failure; WRITE(10)        (04A73600,0008) adapter HW failure
   Jan 21 20:03:52 hdisk0     P SC_DISK_ERR7   path  7 path failure; WRITE(10)        (27BA31F0,0008) adapter HW failure
   Jan 21 20:03:51 hdisk14    P SC_DISK_ERR7   path  6 path failure; WRITE(16)        adapter HW failure
   Jan 21 20:03:51 hdisk12    P SC_DISK_ERR7   path 14 path failure; WRITE(10)        (A7C16800,0040) transport dead
   Jan 21 20:03:51 hdisk9     P SC_DISK_ERR7   path 15 path failure; WRITE(16)        adapter HW failure
   Jan 21 20:03:51 fcs3       I VFC_ERR4       VIOS vios2, vfc_host vfchost67; loc 00F5; Informational message
   Jan 21 20:03:51 hdisk3     P SC_DISK_ERR7   path  7 path failure; WRITE(16)        adapter HW failure
   Jan 21 20:03:51 fcs3       I VFC_ERR4       VIOS vios2, vfc_host vfchost67; loc 00F5; Informational message
   Jan 21 20:03:51 hdisk2     P SC_DISK_ERR7   path 15 path failure; WRITE(10)        (A7DA9240,0040) adapter HW failure
   Jan 21 20:03:51 fcs3       I VFC_ERR4       VIOS vios2, vfc_host vfchost67; loc 00F5; Informational message
   Jan 21 20:03:51 hdisk13    P SC_DISK_ERR7   path  7 path failure; WRITE(16)        adapter HW failure
   Jan 21 20:03:51 fcs3       I VFC_ERR4       VIOS vios2, vfc_host vfchost67; loc 00F5; Informational message
   Jan 21 20:03:51 fscsi3     P FCP_ERR1       Adapter driver rejected I/O request for port 0x024701; ENOPROTOOPT
   Jan 21 20:03:51 fcs3       T VFC_ERR5       VIOS vios2, vfc_host vfchost67; loc 0080; Unexpected error returned from calls to PHYP
The AIX error log events start with VFC_ERR5, which is the key for this issue. This error may be seen if the client device driver has received a response which might be no longer Active, so the VFC client assumed an error and hence put the adapter into error state.  (This is per design.)
The other VFC_ERR4 errors along with the path errors are just the consequence of that preceding error.
Once the ap->state is set to VFC_ERROR, there is no other recovery than to remove (rmdev) and reconfigure the adapter; this appear to have happened at around 20:26 the same day (causing the informational FCP_ERR10 errors to show up).
Root cause of the VFC_ERR5, in this example, was not possible because the state of the adapter was recovered causing any component trace data to be overwritten.
This problem then leads to invalidating the DMA space being used to exchange data between the client and the server.  Hence, the DMA_ERR error on the VIOS when the connection was lost but IO still to be transferred to the client, and the VIODMA DMA_ERR errors on the client on reconfiguration.
There are chances that system firmware or HBA firmware issues may contribute to the problem.  Ensure firmware is up-to-date.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"Component":"","Platform":[{"code":"PF002","label":"AIX"}],"Version":"VIOS 2.2","Edition":"","Line of Business":{"code":"LOB57","label":"Power"}}]

Document Information

Modified date:
20 October 2021

UID

ibm10739749