IBM PureData System for Analytics, Version 7.1

Detecting split-brain

About this task

Split-brain is an error state that occurs when the images of data on each IBM® Netezza® host are different. It typically occurs when synchronization is disabled and users change data independently on each Netezza host. As a result, the two Netezza host images are different, and it becomes difficult to resolve what the latest, correct image should be.

Important: Split-brain does not occur if clustering is enabled. The fencing controls prevent users from changing the replicated data on the standby node. Allow DRBD management to be controlled by Heartbeat to avoid the split-brain problems.
However, if a split-brain problem occurs, the following message is written to the /var/log/messages file:
Split-Brain detected, dropping connection!

While DRBD does have automatic correction processes to resolve split-brain situations, the Netezza implementation disables the automatic correction. Manual intervention is required, which is the best way to ensure that as many of the data changes are restored as possible.

To detect and repair split-brain, work with Netezza Support to follow this procedure:

Procedure

  1. Look for Split in /var/log/messages, usually on the host that you are trying to make the primary/active host. Let DRBD detect this condition.
  2. Because split-brain results from running both images as primary Netezza hosts without synchronization, check the Netezza logs on both hosts. For example, check the pg.log files on both hosts to see when/if updates occur. If there is an overlap in times, both images have different information.
  3. Identify which host image, if either, is the correct image. In some cases, neither host image might be fully correct. You must choose the image that is the more correct. The host that has the image which you decide is correct is the “survivor”, and the other host is the “victim”.
  4. Perform the following procedure:
    1. Log in to the victim host as root and run these commands:
      drbdadm secondary resource
      drbdadm disconnect resource
      drbdadm -- --discard-my-data connect resource
      where resource can be r0, r1, or all.

      Complete these steps for one resource at a time; that is, run all the commands in steps a. and b. for r0 and then repeat them all for r1. There is an all option, but use it carefully. The individual resource commands usually work more effectively.

    2. Log in to the survivor host as root and run this command:
      drbdadm connect resource
      where resource can be r0, r1, or all
      Note: The connect command might display an error that instructs you to run drbdadm disconnect first.
  5. Check the status of the fix by using drbdadm primary resource and the service drbd status command. Make sure that you run drbdadm secondary resource before you start Heartbeat.


Feedback | Copyright IBM Corporation 2014 | Last updated: 2014-02-28