IBM Support

Troubleshooting WebSphere Edge Load Balancer High Availability failover

Troubleshooting


Problem

For high availability problems with the IBM WebSphere Application Server Edge Load Balancer, information is provided to help problem resolution.  It familiarizes you with the troubleshooting process and saves time.

Cause

A high availability failover occurs if:
  1. If the backup Load Balancer fails to receive a heartbeat packet from the active LB.
  2. The backup Load Balancer can communicate with more of the defined reach targets than the active Load Balancer.
  3. Auto recovery is defined and the primary load balancer becomes active.
  4. A manual takeover is issued

There are a number of reasons why a heartbeat fails. Failures occur from network problems, hardware problems, system performance  and so on.

The Load Balancer heartbeats are UDP packets, sent and received on the port defined for high availability in the Load Balancer configuration. Both Load Balancer's must define the same port number in the high availability configuration. 

Diagnosing The Problem

Network traces collected on both Load Balancer's before a takeover always identifies the cause of the takeover.  Since the time of the next takeover is unknown, obtaining valid network traces can be difficult. Review "Using packet trace tools iptrace, snoop, tcpdump, wireshark, and nettl"  for  network trace collection information.

Note: The Load Balancer for IPv4 and IPv6 on AIX® requires extra configuration steps to successfully collect a network trace while the Load Balancer is running.

Network traces continue to grow and there is the possibility of no failover for some time. Assuming that capturing the packets during a failover is feasible, you can identify delayed or lost packets and reach target failures.

If a failover occurred, the following command might offer some information if the problem was due to an inability to transmit a heartbeat:
netstat -m

Note: You might be able to tune and adjust the available memory to resolve the issue when there are failures reported.

Resolving The Problem

If you contacted IBM Support, continue on to the Load Balancer specific diagnostic information. Otherwise, click: MustGather: IBM WebSphere Edge Load Balancer (Dispatcher component).

Load Balancer High Availability Failover specific diagnostic information
Options for debugging takeovers:

1) Set the Load Balancer log level to 5 and increase the log size. The log indicates whether a failover occurred from reachability problems.
dscontrol set loglevel 5
dscontrol set logsize 50000000
dscontrol man reach set loglevel 5
dscontrol man reach set logsize 50000000

2) Add more logging in the go scripts (goActive and goStandby). When using the Load Balancer's high availability feature on Linux (not Linux on System z) or Windows operating systems, do not alias addresses in the go scripts. The user exit is available on all platforms and can be used to gather data when a takeover occurs.

Add the following lines to the end of the goActive and goStandby scripts.


goActive:
echo "After goActive: " >> ../logs/dispatcher/goActive.log
ifconfig -a >> ../logs/dispatcher/goActive.log
netstat -an >> ../logs/dispatcher/goActive.log
netstat -nr >> ../logs/dispatcher/goActive.log
arp -a >> ../logs/dispatcher/goActive.log

goStandby:
echo "After goStandby: " >> ../logs/dispatcher/goStandby.log
ifconfig -a >> ../logs/dispatcher/goStandby.log
netstat -an >> ../logs/dispatcher/goStandby.log
netstat -nr >> ../logs/dispatcher/goStandby.log
arp -a >> ../logs/dispatcher/goStandby.log

If addresses are aliased in the go script, it is helpful to put debugging statements before and after the aliasing commands in the scripts. It is helpful to put date and time at the start and end of the scripts.

3) Gather network traces on both Load Balancers. As mentioned previously, you might need to periodically stop and restart the network trace, otherwise it continues to grow. Monitor the tracing during working hours and try to stop it as soon as a failover occurs.

Note: The network trace can show the breaks in the communication (for example: periodic slowdowns with gaps of +5 seconds). Depending on what the network trace shows, the answer would be to increase the high availability timeout to avoid the gaps or debug the performance issues (OS or network) that cause the timeouts. Typical causes are:
  • Extreme network loads degrading performance on the network interface cards.
  • Large firewall filter rule sets delay the heartbeat packet arrival to the Load Balancer.
  • Network congestion delaying or dropping packets.
From the iptrace, you might be able to get an idea on what might be happening. The following command might also show adapter errors (dropped packets, etc).
netstat -v
Once the failover occurs, you can reset the logging.
 
kill <pid>  ("pid" of the iptrace)
dscontrol set loglevel 1
dscontrol man reach set loglevel 1

NTP (Network Time Protocol) can cause unexpected takeovers. Heartbeats are expected every 0.5 seconds and a large change in the clock can trigger detection that the partner is not available. Time updates must be limited to less than 0.5 seconds per update.

Another thing that can be checked is the PMTU discovery to see whether it contains many routes. For example, in AIX, for each packet, the OS must search through this list of routes to find the appropriate one for this particular packet. Any packet needing to use the default route search all of the routes thus causing added computations (on a per packet basis).

To turn it off:
no -o tcp_pmtu_discover=0
no -o udp_pmtu_discover=0
Edit the rc.tcpip file, located in the /etc directory, and add the preceding two commands to the end of that file so the change persists.
Provide the following information during the problem, from both machines:
  • server.log (collected with loglevel 5)
  • goActive.log
  • goStandby.log
  • configuration files
  • highavailchange script (if applicable)
  • goActive and goStandby scripts
  • network trace in binary format
  • netstat -m
  • netstat -v
Follow instructions to send diagnostic information to IBM support.

For a listing of all technotes, downloads, and educational materials specific to the Load Balancer component, search the WebSphere Application Server support site.

[{"Type":"MASTER","Line of Business":{"code":"LOB45","label":"Automation"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSEQTP","label":"WebSphere Application Server"},"ARM Category":[{"code":"a8m0z000000blvSAAQ","label":"IBM Edge Load Balancer-\u003EHA (High Availability)-\u003EDelay, latency, or perf HA fail"}],"ARM Case Number":"","Platform":[{"code":"PF002","label":"AIX"},{"code":"PF016","label":"Linux"},{"code":"PF033","label":"Windows"}],"Version":"All Versions"}]

Document Information

Modified date:
21 April 2022

UID

swg21218379