IBM Support

PH13808: SPLIT-BRAIN HADR FAILURE MAY OCCUR WITH AUTOMATED FAILOVER

Subscribe

You can track all active APARs for this component.

 

APAR status

  • Closed as fixed if next.

Error description

  • Split-brain failures occur when both head nodes are active and
    accepting transactions that cannot later be reconciled because
    of different and incompatible log chains. This may happen when
    there is a transaction on the Primary that has not yet been
    copied over to the Standby before a failover occurs,
    particularly if the failover is performed by force.  When HA is
    enabled and Failover Automation is enabled, the Failover
    Controller may issue a "by force" takeover command, which may be
     vulnerable to split-brain HADR failure.
    
    To reduce the likelihood of split-brain, Failover Automation
    will now use "by force peer window only", to allow time for the
    Standby to catch up to the Primary.  In addition, this APAR
    addresses an issue that may have allowed the Primary head node
    to shut down before the Standby can take over, which would cause
    the Standby to use "by force" unnecessarily.
    
    It is not possible to completely eliminate the possibility of
    split-brain; a split-brain failure could still occur if the
    Primary head node goes down unexpectedly and remains down for
    longer than the HADR peer window
    

Local fix

  • If ssh latency is an issue in your cluster, you can increase
    the size of the HADR_PEER_WINDOW with the following command:
    db2 update db cfg for bigsql dbpartitionnum 0 using
    hadr_peer_window <new value>
    followed by a restart of the head node.
    .
    The new value specified would depend on the specifics of the
    cluster and latency.
    

Problem summary

  • Please see problem description.
    

Problem conclusion

Temporary fix

Comments

APAR Information

  • APAR number

    PH13808

  • Reported component name

    IBM BIG SQL

  • Reported component ID

    5737E7400

  • Reported release

    504

  • Status

    CLOSED FIN

  • PE

    NoPE

  • HIPER

    NoHIPER

  • Special Attention

    NoSpecatt / Xsystem

  • Submitted date

    2019-06-27

  • Closed date

    2020-09-09

  • Last modified date

    2020-09-09

  • APAR is sysrouted FROM one or more of the following:

  • APAR is sysrouted TO one or more of the following:

Fix information

Applicable component levels

[{"Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSCRJT","label":"IBM Db2 Big SQL"},"Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"504"}]

Document Information

Modified date:
10 September 2020