IBM Support

Heartbeat (PDA cluster) not able to start RPCs not reachable

Troubleshooting


Problem

Hearbeat can not start or be restarted if one of the appliance lower rpcs is not reachable I

Symptom

You try to start hearbeat but it fails with long list of error messages

If these lines are visible it means there is an issue checking the snmp response from the lower rpcs that control the stonith function in the cluster.


heartbeat[22784]: 2015/07/29_19:51:24 info: glib: apcmastersnmp_set_config: Starting apcmastersnmp V1.1
heartbeat[22784]: 2015/07/29_19:51:24 info: glib: apcmastersnmp_set_config: Hostnames are 10.0.128.32 and 10.0.128.132.
heartbeat[22784]: 2015/07/29_19:51:30 ERROR: glib: APC_read: error sending/receiving pdu (cliberr: 0 / snmperr: -24 / error: Timeout).
heartbeat[22784]: 2015/07/29_19:51:30 ERROR: glib: NZ_set_config: cannot read number of outlets on 1st rpc.
heartbeat[22784]: 2015/07/29_19:51:30 ERROR: Unknown Stonith config error parsing [ 10.0.128.32 161 private] [2]
heartbeat[22784]: 2015/07/29_19:51:30 debug: glib: PILS: Looking for HBcomm/use_logd => [/usr/lib64/heartbeat/plugins/HBcomm/use_logd.so]
heartbeat[22784]: 2015/07/29_19:51:30 debug: glib: Plugin file /usr/lib64/heartbeat/plugins/HBcomm/use_logd.so does not exist
heartbeat[22784]: 2015/07/29_19:51:30 debug: glib: PILS: Looking for HBcomm/use_logd => [/usr/lib64/pils/plugins/HBcomm/use_logd.so]
heartbeat[22784]: 2015/07/29_19:51:30 debug: glib: Plugin file /usr/lib64/pils/plugins/HBcomm/use_logd.so does not exist
heartbeat[22784]: 2015/07/29_19:51:30 debug: glib: Plugin file /usr/lib64/pils/plugins/HBcomm/use_logd.so does not exist
heartbeat[22784]: 2015/07/29_19:51:30 info: Enabling logging daemon

Cause


The rpc1ll is unreachable so the hearbeat start process can not read the snmp response.

[root@NZ35172-H1 ~]# ping rpc1lr
PING rpc1lr (10.0.128.32) 56(84) bytes of data.
64 bytes from rpc1lr (10.0.128.32): icmp_seq=1 ttl=255 time=2.54 ms
64 bytes from rpc1lr (10.0.128.32): icmp_seq=2 ttl=255 time=3.32 ms
64 bytes from rpc1lr (10.0.128.32): icmp_seq=3 ttl=255 time=3.28 ms

64 bytes from rpc1lr (10.0.128.32): icmp_seq=4 ttl=255 time=3.30 ms
64 bytes from rpc1lr (10.0.128.32): icmp_seq=5 ttl=255 time=3.29 ms
^C
--- rpc1lr ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4454ms
rtt min/avg/max/mdev = 2.540/3.149/3.322/0.311 ms
[root@NZ35172-H1 ~]# ping rpc1ll
PING rpc1ll (10.0.128.132) 56(84) bytes of data.
^C
--- rpc1ll ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11867ms

Diagnosing The Problem

In case of single rack:

Ping both rpc1lr and rpc1ll

In case of multirack system

Ping rack 1 and 2 lower rpcs ( rpc1lr, rpc1ll, rpc2lr, rpc2ll)

Also run to diagnosis possible management network issues


/nz/kit/bin/adm/tools/nznetw

Resolving The Problem

In this case the nznetw showed that the port to rpc1ll was down


******************************
Querying management switch(es)
******************************

Link Status for Management Switch 1

Port 1 : To rack 1 lower left RPC down [FAIL]
Port 2 : To rack 1 upper left RPC up [PASS]
Port 3 : To HA1 LOM3 (eth0) up [PASS]

Just try to enable it using the steps in

[{"Product":{"code":"SSULQD","label":"IBM PureData System"},"Business Unit":{"code":"BU053","label":"Cloud & Data Platform"},"Component":"Cluster","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"1.0.0","Edition":"","Line of Business":{"code":"LOB10","label":"Data and AI"}}]

Document Information

Modified date:
17 October 2019

UID

swg21963833