IBM Support

QRadar: High Availability (HA) failover occurred due to a failed ping test

Troubleshooting


Problem

How do you recover from a High Availability (HA) failover due to a failed ping test?

Symptom

QRadar HA primary host (typically the active host) suddenly fails over to the secondary host.
 
Note: You can review the status of appliances from the user interface. To view the appliance status, select QRadar Console UI > Admin tab > System and Licensing Management. If a failover has occurred, the primary or previously active appliance displays Offline or Standby and the secondary displays Active in the Status column.

Cause

One potential cause of a QRadar Console HA failed ping test is that it failed the network connectivity tests.
 
Network connectivity tests

Order of operations for a network connectivity tests for HA appliances:

  • Every 30 seconds (default) the primary node does a loop through the list of ping-hosts.
  • For each host, a 5-packet ping sweep is executed.
  • If the ping sweep to a host fails, then alert mode is engaged and the primary node tells the standby node to ping all ping-hosts.
  • If a full ping sweep from the standby node is 100% successful, then the primary pings the failed node it failed with.
  • If that ping-sweep is successful then no fail-over and the primary continues to check all ping-nodes one by one.
  • If the ping fails -- primary starts fail-over.
 Other scenarios that can cause the active appliance to failover
  • A power supply failure.
  • A network failure that is detected by network connectivity tests.
  • An operating system malfunction that delays or stops the heartbeat ping tests.
  • A complete Redundant Array of Independent Disks (RAID) failure on the primary HA host.
  • A manual failover.
 The following scenarios do not cause an automatic HA failover
  • If a QRadar process develops an error, stops functioning, or exits with an error.
  • If a disk on your primary HA host reaches 95% capacity, QRadar data collection stops, but the primary HA host continues to function.

Environment

Diagnosing The Problem

To determine whether a QRadar Console HA failover was due to the failed ping test review logs for messages that indicate this failure. In /var/log/qradar.log file, review for HAManager service information messages:
 
[ha_manager.ha_manager] [HAManager] com.q1labs.ha.manager.StateMachine: [INFO] [NOT:0000006000][<IP_Address>/- -] [-/- -]Triggering ping_status based failover: local ping_status 0.67
[ha_manager.ha_manager] [HAManager] com.q1labs.ha.manager.StateMachine: [INFO] [NOT:0000006000][<IP_Address>/- -] [-/- -]Sensor ping_status based failover started giveback (SMD031002/14200161)
[ha_manager.ha_manager] [HAManager] com.q1labs.ha.manager.HAManager: [INFO] [NOT:0000006000][<IP_Address>/- -] [-/- -]Starting giveback: sending local OFFLINE
Note: The error message above confirms the failure, but does not identify with the managed host that failed the network connectivity test.

 

Resolving The Problem

  1. Identify which managed hosts are unable to communicate with the primary HA host. You can review the list of managed hosts in the QRadar Console UI > Admin tab > System and Licensing Management. If you have a managed host that has been deprecated, you can remove the managed host list. For more information, see: Removing a managed host.
    1. Alternately, you can run the following command to show which managed hosts the primary HA Console runs the Network Connectivity Test.
      Example: grep PING /opt/qradar/ha/ha.conf
    2. You will see something similar to:
      PING_NODES=x.x.x.x, x.x.x.x, x.x.x.x
  2. Check for the managed hosts that might have network connectivity issues. Work with your network team to resolve this issue. When all the managed hosts are properly communicating with the primary HA host, you then failback to the primary Console.
  3. Go to, QRadar Console UI > Admin tab > System and Licensing Management, verify that the primary HA host is in a Standby status.
  4. When communications to the primary HA Console are restored, it does not automatically take over the active status in the HA cluster. Instead, the secondary HA host remains the active system, and the primary host acts as the standby system.
Important: You must switch the primary back to the active status after successfully recovering from a primary failure. For more information, see: Switching a primary HA host to active.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB24","label":"Security Software"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSBQAC","label":"IBM Security QRadar SIEM"},"ARM Category":[{"code":"a8m0z000000cwtXAAQ","label":"High Availability"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"All Versions"}]

Document Information

Modified date:
03 June 2021

UID

ibm10874926