To short heartbeat sensitivity might lead to restarts on PureScale

Technical Blog Post

Abstract

Body

During testing of failures in redundant VIOS (being part of DB2 PureScale), we noticed that we might have unexpected member restart. It was surprising, as VIOS was redundant, and we expected that members will survive such failure. In order to troubleshoot we need DB2, TSA, RSCT and AIX level logs (most of those are collected via standard call to db2support -purescale). Once we obtained the logs, we narrowed it down up to event:

LABEL:        CONFIGRM_REBOOTOS_E
Date/Time:       Mon May 2 19:24:30 2016
Type:            PERM
Resource Name:   ConfigRM
Description
The operating system is being rebooted to ensure that critical resources are
stopped so that another sub-domain that has operational quorum may recover
these resources without causing corruption or conflict.
Detail Data
DETECTING MODULE
RSCT,PeerDomain.C,1.99.30.9,22945
ERROR ID

As we can see, problem originates on layers other than DB2 (ConfigRM). To move forward we looked from automation perspective - short before reboot an adapter error was reported:

LABEL: TS_LOC_DOWN_ST
Date/Time:       Mon May 2 19:24:30 2016
Node Id:         db2-mbr1a
Description      Possible malfunction on local adapter
Adapter IP address   10.20.46.13

So it seems that despite of VIOS the adapter was seen down and therefore there was a split brain and node was rebooted. As we can not tell a lot from TSA perspective, we have verified from OS perspective. From following lssrc -ls cthats output, only CG1 on en0 formed HB group between 4 nodes:

Subsystem Group PID Status

cthats           cthats           11665490 active
Network Name   Indx Defd Mbrs St   Adapter ID      Group ID
CG1            [ 0] 4     4     S    10.20.46.13     10.20.46.14
CG1            [ 0] en0              0x47279086      0x472790d7
HB Interval = 0.400 secs. Sensitivity = 4 missed beats
Ping Grace Period Interval = 30.000 secs.
Missed HBs: Total: 10 Current group: 10
Packets sent    : 4607543 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 12445664 ICMP 0 Dropped: 0
NIM's PID: 9371838
RoCE-CG51      [ 1] 4     1     S    10.20.51.13     10.20.51.13
RoCE-CG51      [ 1] hca0             0x87279084      0x872790d6
HB Interval = 0.050 secs. Sensitivity = 4 missed beats
Missed HBs: Total: 0 Current group: 0
Packets sent    : 0 ICMP 0 Errors: 0 No mbuf: 0
Packets received: 0 ICMP 0 Dropped: 0
NIM's PID: 8847670
RoCE-CG53      [ 2] 4     1     S    10.20.53.13     10.20.53.13
RoCE-CG53      [ 2] hca1             0x87279085      0x872790d7
HB Interval = 0.050 secs. Sensitivity = 4 missed beats

Following nim, nmDiag and cthats log showed since 19:37:51, en0 didn't received neighbor node HB packet, not able to ping 10.20.46.1 defined on netmon.cf file, nim declared en0 DOWN, at 19:38:02, it was able to ping 10.20.46.1 again, nim declared en0 UP. So there was 12 sec network outage caused nim declared en0 DOWN.

en0 nim log:
05/02 19:36:02.486: Heartbeat was NOT received. Missed HBs: 1. Limit: 4
05/02 19:37:51.608: Heartbeat was NOT received. Missed HBs: 1. Limit: 4
05/02 19:37:52.410: Heartbeat was NOT received. Missed HBs: 2. Limit: 4
05/02 19:37:52.410: Starting sending ICMP ECHOs.
05/02 19:37:52.410: Invoking netmon to find status of local adapter.
05/02 19:37:53.212: Heartbeat was NOT received. Missed HBs: 3. Limit: 4
05/02 19:37:53.626: netmon response: Adapter seems down
05/02 19:37:54.016: Heartbeat was NOT received. Missed HBs: 4. Limit: 4
05/02 19:37:54.016: Invoking netmon to find status of local adapter.
05/02 19:37:54.016: Local adapter is down: issuing notification for local adapter
05/02 19:37:54.016: Adapter status successfully sent.
05/02 19:37:54.047: Dispatching netmon request while another in progress.
05/02 19:37:54.047: Received a SEND MSG command. Dst: 10.20.46.14.
05/02 19:37:55.232: netmon response: Adapter seems down
05/02 19:37:55.232: Adapter status successfully sent.
05/02 19:38:02.099: netmon response: Adapter is up

en0 nmDiag log:
05/02 19:37:52.610: UpdatePstate:Called for en0.
05/02 19:37:52.610: nmPingOnly:Beginning: 1 successful pings required.
05/02 19:37:52.610: nmPingOnly:Sending to array indices [0-0]
05/02 19:37:52.610: nmSendPingOnly:Sending ECHO_REQUEST to 10.20.46.1
05/02 19:37:52.611: nmOpenRawSocket:Opened RAW INET socket for en0. fd = 14
05/02 19:37:52.681: nmPingOnly:0 IPv4 responses received so far.
05/02 19:37:52.714: nmPingOnly:Test failed: 0 pings received; no change to istate.
05/02 19:37:52.714: UpdatePstate:Stimulation 1 failed.
05/02 19:37:52.714: UpdatePstate:en0 istate was QUIET, NO activity, pstate:UNKNOWN.
05/02 19:37:52.714: nmAdapterGetStatus() returned -1
05/02 19:37:52.714: Attempt 0: adapter failed
..........................
05/02 19:38:01.953: UpdatePstate:Called for en0.
05/02 19:38:01.953: nmPingOnly:Beginning: 1 successful pings required.
05/02 19:38:01.953: nmPingOnly:Sending to array indices [0-0]
05/02 19:38:01.953: nmSendPingOnly:Sending ECHO_REQUEST to 10.20.46.1
05/02 19:38:02.059: nmRecvPingOnly:62-byte packet received from 10.20.46.1.
05/02 19:38:02.059: nmRecvPingOnly:Received ICMP_ECHOREPLY for 10.20.46.13 seq: 56 icmp_id 76 (0x4c).
05/02 19:38:02.059: nmPingOnly:1 IPv4 responses received so far.
05/02 19:38:02.099: nmPingOnly:Test succeeded: 1 pings received; Setting istate to UP.
05/02 19:38:02.099: nmCloseRawSocket:Closing RAW INET socket 14.
05/02 19:38:02.099: UpdatePstate:en0 istate was QUIET, Caused activity, pstate:UP.
05/02 19:38:02.099: nmAdapterGetStatus() returned 1
05/02 19:38:02.099: Attempt 2: adapter succeeded

cthats log:
05/02 19:37:54.016:hatsd[0]: Received adapter status for address
10.20.46.13. Adapter state: 3 (HA_NIM_ADAPTER_IS_DOWN).
05/02 19:37:54.016:hatsd[0]: Adapter is down.
05/02 19:37:54.016:hatsd[0]: Disabling adapter 10.20.46.13.
05/02 19:37:54.016:hatsd[0]: Adapter (10.20.46.13:0x47278dd3) is dead.
05/02 19:37:54.016:hatsd[0]: Notifying leader (10.20.46.14:0x46e7d03f) of death. FFDC id [---].
05/02 19:37:54.016:hatsd[0]: Sending adapter [Hb_Death] notifications.
05/02 19:37:54.016:hatsd[0]: Sending Notification packet for [allAdapter_13_0_mixed] subscription.
05/02 19:37:54.016:hatsd[0]: Reachable nodes (1 hop)    : 1

05/02 19:37:55.232:hatsd[0]: Received adapter status for address
10.20.46.13. Adapter state: 3 (HA_NIM_ADAPTER_IS_DOWN).
05/02 19:38:02.167:hatsd[0]: Received adapter status for address
10.20.46.13. Adapter state: 1 (HA_NIM_ADAPTER_IS_UP).

In order to fix this, we suggest customer relaxing Heartbeat Sensitivity settings. In this system, values of 4 (sensitivity) and 0.4 (period) allow for 3.2 seconds of network latency before RSCT decides that the heartbeat attempt between two nodes is unsuccessful and thus recover actions are necessary. We recommend leaving the Sensitivity at 4 and changing the Period to 4 which will allow for 32 seconds before RSCT declares a problem. Relaxing these settings can prevent unwanted behavior such as an unexpected reboot.

In order to do this, user has to issue the "lscomg" command to find out "CommGroup Name", then modify the settings to our recommended values, issue the following from any node:

export CT_MANAGEMENT_SCOPE=2
chcomg -s 4 -p 4 <CommGroup_Name>

Apply the change to all configured communication groups listed in the "lscomg" output.

[{"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"Component":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

UID

ibm11140100

Tips

To short heartbeat sensitivity might lead to restarts on PureScale

Technical Blog Post

Abstract

Body

UID

Share your feedback

Need support?