IBM Support

MustGather Testcase for VIOS Crash or VIOS Hang Condition

Troubleshooting


Problem

PowerVM Virtual I/O Server (VIOS) crashed or hung and need to determine root cause.

Symptom

VIOS hung (lost console and network access) or experienced a system crash.

Cause

There are different factors that can lead a VIOS partition to crash or hung. This may include, but it is not limited to, code defect or hardware malfunction. VIOS memory resource plays an important role in the VIOS availability. More often than not, VIOS crashes are due to insufficient memory resources on the VIOS partition.
It is strongly recommended to read this document entirely before taking any actions to determine which scenario may be best applicable to your situation.

Environment

VIOS version 3.1

Diagnosing The Problem

In order to determine what happened, system dump data must be collected immediately after the system crash or right after forcing a system dump in cases where the VIOS partition is determined to be in a hang condition.
To determine if there is a good system dump, login as padmin and run:
$ oem_setup_env
# sysdumpdev -L
0453-039
Device name: /dev/lg_dumplv
Major device number: 10
Minor device number: 11
Size: 21010432 bytes
Date/Time: Tue Feb 12 13:05:33 CST 2015
Dump status: 0                     <-----
dump completed successfully        <-----
Common dump status codes include (but are not limited to):
-0 Successful system dump
-2 Dump device is too small (partial dump may be available)
-3 Dump facility has failed (dump crashed or did not start)
-4 Dump failed due to i/o error (bad disk or disk adapter)

Resolving The Problem

First, determine if the VIOS crashed or hung.
VIOS System Crash Condition
A system-generated crash can be identified by a dump progress indicator of 0c9.  The VIOS typically boots back up automatically after the dump completes.  If the VIOS partition successfully booted back up after the system crash and 'sysdumpdev -L' reports "Dump status: 0", proceed to gather VIOS snap and contact your local IBM Support Line Representative to have the dump analyzed.  Contrary to this, if your Dump Status is a nonzero value, see section VIOS Crashed or Hung and there is no successful dump at the end of this document.
VIOS Hang Condition
 
Note: A VIOS partition is considered to be in a hung condition if the partition was in a Running state and suddenly lost both network and console access.
If the VIOS does not have network connectivity (i.e. telnet, ssh), check for console access to verify if the VIOS is "hung". 
For HMC-managed VIOS, this can be done via HMC GUI:
Select the VIOS partition
> Click Console Window
> Click Open Terminal Window (or log in to the HMC command prompt as hscroot > type vtmenu > enter Number of the Managed System name > enter number of VIOS partition in question).

If the VIOS has no network connectivity, but it is accessible via console, this indicates the partition is not in a hung condition, and it is possible that the issue could be a network-related problem. Networking is beyond the scope of this document. For assistance troubleshooting VIOS/AIX networking issues, please, contact your network administrator to see if they can justify the network outage. If no issues are identified by your network administrator that could justify the loss of network connectivity to the VIOS, contact your local IBM SupportLine Representative and request to speak to the networking team for investigation.

If the VIOS partition is determined to be hung (no network connectivity and no console access) as previously noted, and root cause analysis is needed, a system memory dump must be forced while the partition is in the hung state.

IMPORTANT

  1. Forcing a OS system dump on the VIOS partition will cause the VIOS to be rebooted.  Therefore, ensure all client partitions that are served by that VIOS are fully redundant for network and disk IO through a second VIOS before initiating the OS system dump.
  2. This does not apply to a VIOS partition that is part of  an SSP cluster environment. If the hung VIOS is part of an SSP cluster, contact your local IBM SupportLine Representative to discuss. Do NOT force a dump.

To force a system dump on the VIOS, refer to the following technote:

How to force a system dump
> Section LPAR using HMC > HMC GUI

After forcing the system memory dump, check that the VIOS partition boots up to the padmin login prompt, and verify 'sysdumpdev -L' output reports the dump completed successfully. If there is a good dump (Dump status: 0), proceed to capture VIOS snap data. Then contact you local IBM SupportLine Representative to have the dump analyzed.

VIOS Crashed or Hung and there is no successful dump

If 'sysdumpdev -L' does not report a successfully dump, root cause analysis is not possible due to insufficient data (no good dump).
In some cases, there may be a partial dump (dump status: -2) or a minidump that might allow for an educated speculation. If a minidump was captured, it will be reflected in the error log. To review the error log, login as padmin and run:
$ errlog

[{"Type":"MASTER","Line of Business":{"code":"LOB57","label":"Power"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSPHKW","label":"PowerVM Virtual I\/O Server"},"ARM Category":[{"code":"a8m50000000L0KXAA0","label":"PowerVM VIOS-\u003ESYSTEM DUMP\/CRASH\/HANG"},{"code":"a8m50000000L0KXAA0","label":"PowerVM VIOS-\u003ESYSTEM DUMP\/CRASH\/HANG"},{"code":"a8m50000000L0KXAA0","label":"PowerVM VIOS-\u003ESYSTEM DUMP\/CRASH\/HANG"}],"ARM Case Number":"TS004049189","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"3.1.1;3.1.2;3.1.3"}]

Document Information

Modified date:
07 May 2024

UID

isg3T1023955