IBM Support

7063-CR1 HMC becomes unresponsive requiring a manual restart

Troubleshooting


Problem

The 7063-CR1 Hardware Management Console Hardware Appliance can become unresponsive and require a manual restart from either the BMC or through the power button.

Symptom

Symptoms include:
  • The HMC is unresponsive to ping, ssh, or https access
  • The HMC may becomes unresponsive in a predictable number of days (Ex. every 25 days)

Cause

The 7063-CR1 HMC boot loader, Petitboot, provides two console settings, hvc0 (serial), or tty1 (VGA). This Petitboot setting may then be inherited by the Operating System and the kernel obtains access to one or multiple active consoles.
When the BMC restarts (which can occur during runtime), and the list of active consoles on the HMC contains the hvc0 (serial) console, the HMC is exposed to a condition that leads to it becoming unresponsive.

Environment

All 7063-CR1 HMCs with the hvc0 (serial) console set as an active console, are exposed to the problem.

Diagnosing The Problem

This problem is mostly seen on 7063-CR1 HMCs updated to v9r2.952 or newer, or v10r1.1011 or newer, however, prior versions could also experience the problem.
There are several ways to diagnose the problem but most require analysis of logs by IBM Support Personnel.
However, if this problem is suspected, the user can run the following command, to verify the current list of active consoles on the HMC:
cat /sys/class/tty/console/active
If the output contains hvc0 by itself or alongside another console, the HMC is exposed to the problem.

Resolving The Problem

Fixes
The fix for this problem will be included in the following PTFs:
 
For V10 R2:
  • MF71694 - V10 R2 1042.2
For V10 R3:
  • MF71698 - V10 R3 1060.1 
Note: HMCs at V10 R3 105X should update to 1060.1 or newer.

Workarounds

To restart the HMC after it becomes unresponsive, use one of the following methods:
  1. Locally, press the power button until the HMC powers off, then power it back on
  2. Remotely, access the BMC web UI and restart the system
 
There are two mitigations:
 
Mitigation #1: Setting the boot console setting to tty1 in Petitboot
 
Mitigation #2: Rebooting the HMC prior to the period of time it normally takes to become unresponsive
In some cases, the problem occur in a predictable pattern. For example, in one observed pattern, the problem happen around 25 days apart.
In other cases the window is shorter but there is an identifiable pattern.
For these cases, where a pattern can be identified, perform a preemptive restart of the HMC prior to the identified timeframe.
For example, if the HMC experiences the problem approximately every 25 days, restart around the 20 day mark.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB68","label":"Power HW"},"Business Unit":{"code":"BU070","label":"IBM Infrastructure"},"Product":{"code":"7063CR1","label":"Hardware Management Console (7063-CR1)"},"ARM Category":[{"code":"a8mKe000000000fIAA","label":"HMC-\u003E7063 Hardware"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}]}]

Document Information

Modified date:
19 July 2024

UID

ibm17157787