IBM Support

Setting up two-node Db2 HADR Pacemaker cluster with fencing on Microsoft Azure

How To


Summary

This document provides an alternate Db2 HADR configuration with Pacemaker without using a third lightweight host as a quorum device (Qdevice) arbitrator. It discusses the pros and cons of the alternative setup with fencing only, thereby guiding the choice to the user based on cost vs recovery time tradeoff.

Objective

The objective of this document is to detail an alternative to the two-node HADR + quorum device best practice Pacemaker solution detailed in the IBM Documentation here: Quorum devices support on Pacemaker - IBM Documentation

On Microsoft Azure, you do not necessarily need to configure a quorum device on a third host.  Instead, you can configure fencing as described in this document. 

The advantage of configuring a two-node HADR Pacemaker cluster with fencing is that it removes the requirement of a third host for the quorum device, thus reducing on-going cost. 

The disadvantage is that the longer recovery time from primary host failure due to the added time it takes to successfully fence the failed host from the cluster. Based on our internal test result in a controlled environment, it can take up to 6 times longer to recover from a primary host failure with fencing compared to when using a quorum device host.  To compensate for this, the HADR_PEER_WINDOW value of all databases must be set to at least 300 seconds. 

The choice of configuration should be based on your specific business requirements by taking recovery time and cost of implementation into account.  Fencing on Microsoft Azure is done via the fence_azure_arm agent. The fence_azure_arm is a Fencing agent for Azure Resource Manager. It uses Azure SDK for Python to connect to Azure.

Db2 does not include the Fencing Agent for Azure in the Db2 installation image. You must download the package for the Azure Fencing Agent from following website:

https://www-01.ibm.com/marketing/iwm/platform/mrs/assets?source=mrs-db2pcmk&_ga=2.31425788.296289340.1604345966-344484498.1579133947%C2%A0

To install the fencing agent, perform the following steps:

1. Download the latest version of the Azure fencing agent, e.g. Db2_Azure_fence_agent_4.7.1-3_noarch.tar.gz from above website. 

2. Unpack the archive by using the following command: tar -zxf Db2_Azure_fence_agent_4.7.1-3_noarch.tar.gz

The above creates the directory Db2_Azure_fence_agent_4.7.1-3_noarch.

3. Install the rpm

Switch to the above directory, followed by the operating System identifier and issue the following command:

For SLES:  zypper install --allow-unsigned-rpm *.rpm

For RHEL:  dnf install *.rpm

Note:

The fencing agents must be installed on both nodes in the cluster.

Once installed, follow the steps below to configure a two host HADR Pacemaker cluster with fencing on Microsoft Azure.

Environment

Refer to the following IBM Documentation page for a list of platforms supported by Pacemaker, these same restrictions apply here: Restrictions on Pacemaker - IBM Documentation

Steps

1. Refer to the “Configuring a clustered environment using the Db2 cluster manager (db2cm) utility” page of the IBM Documentation to deploy the automated HADR solution: Configuring a clustered environment using the Db2 cluster manager (db2cm) utility - IBM Documentation

2. Create Azure Fence agent STONITH device that uses a service principal to authorize against Microsoft Azure. Follow these steps to create a service principal.

  • Go to https://portal.azure.com
  • Open the Azure Active Directory” on the right menu.
    Go to Properties and make a note of the Directory ID. This is the tenant ID.
  • Click ”App registrations”
  • Click “New registration”
  • Enter a Name, for example “PCMK1”
  • Select "Accounts in this organization directory only"
  • Select Application Type "Web", enter a sign-on URL (for example http://localhost) and click Add
    The sign-on URL is not used for the Pacemaker setup and can be any valid URL like http://localhost.
  • Select “Certificates and secrets”, then click “New client secret”
  • Enter a description for a new key, select "Never expires" and click “Add”
  • Make a note on the Value. It is used as the password for the service principal
  • Select “Overview”. Make a note on the Application ID. It is used as the username (login ID in the steps below) of the service principal

For details, refer to the following tutorial: https://docs.microsoft.com/en-us/azure/active-directory-b2c/tutorial-register-applications?tabs=app-reg-ga

Create a custom role for the fence agent

The service principal does not have permissions to access your Azure resources by default. You need to give the service principal permissions to start and stop (power-off) all virtual machines of the cluster. If you did not already create the custom role, you can create it using PowerShell or Azure CLI on one of the machines in the cluster.

Use the following content for the input file. You need to adapt the content to your subscriptions that is, replace 12345678-9abc-def1-2345-6789abcdef12 and 87654321-cba9c-1fed-5432-21fedcba9876 with the IDs of your subscription. If you only have one subscription, remove the second entry.

“assignableScopes”.
{
    "properties": {
        "roleName": "Linux Fence Agent Role",
        "description": "Allows to power-off and start virtual machines",
        "assignableScopes": [
            "/subscriptions/12345678-9abc-def1-2345-6789abcdef12",
            "/subscriptions/87654321-cba9c-1fed-5432-21fedcba9876"
        ],
        "permissions": [
            {
                "actions": [
                    "Microsoft.Compute/*/read",
                    "Microsoft.Compute/virtualMachines/powerOff/action",
                    "Microsoft.Compute/virtualMachines/start/action"
                ],
                "notActions": [],
                "dataActions": [],
                "notDataActions": []
            }
        ]
    }
}

Assign the custom role to the service principal

Assign the custom role "Linux Fence Agent Role" that was created in the last section to the service principal on both nodes in the cluster. Do not use the Owner role anymore!

  • Go to https://portal.azure.com
  • Open All resources in the menu on the left.
  • Select the virtual machine of the first cluster node
  • Click Access control (IAM)
  • Click Add role assignment
  • Select the role "Linux Fence Agent Role"
  • Enter the name of the application, e.g “PCMK1” you created above
  • Click Save

Repeat the steps above for the second cluster node.

 3. Stop the Db2 instance on both hosts: 
     db2stop force 

4. Log in as root as it is required for steps 5. through 9.  

5. Stop the Pacemaker cluster on both hosts: 
    crm cluster stop 

6. Edit the /etc/corosync/corosync.conf file on both hosts.

A. Add the "wait_for_all: 0" clause
quorum { 
     provider: corosync_votequorum 
     two_node: 1
     wait_for_all: 0 
} 
B. Increase the value for token from 10.000 to 30.000
totem {
    version: 2
    cluster_name: pa2dom
    transport: knet
    token: 30000
    crypto_cipher: aes256
    crypto_hash: sha256
}
Note: For details about the timeout value for missed tokens, refer to this Documentation: Maintenance and updates - Azure virtual machines | Microsoft Docs

7. Start the Pacemaker cluster on both hosts: 
     crm cluster start 

8. Monitor the “crm status” output, once both hosts report online, start the Db2 instance on both hosts and re-activate all HADR databases. 

9. Enable the stonith-enabled property: 
  crm configure property stonith-enabled=true 

 

Note: The following error will be show and indicates a configuration mismatch. You can ignore this error as the fencing agent will be created in the next step. 

  ERROR: (unpack_resources)    error: Resource start-up disabled since no STONITH resources have been defined
         (unpack_resources)   error: Either configure some or disable STONITH with the stonith-enabled option
         (unpack_resources)   error: NOTE: Clusters with shared data need STONITH to ensure data integrity
  Errors found during check: config not valid

10. Set no-quorum-policy to stop.

crm configure property no-quorum-policy=stop

11. Configure the fencing agent in the cluster by running the following two commands: 

crm configure primitive rsc_st_azure stonith:fence_azure_arm params  subscriptionId="12345678-9abc-def1-2345-6789abcdef12" resourceGroup="DB2-NEW-PACEMAKER" tenantId="abcdef1234-5678-9abc-def1-23456789abcd" login="12345678-1234-1234-1233-123456789abc" passwd="VO8IdTp7REE9_A--QWrWNyO_IAvk9vKYu7" pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 op monitor interval=3600 timeout=120 pcmk_host_map="azr-rd01:10.79.224.7;azr-rd02:10.79.224.8" meta target-role=Started
crm resource manage rsc_st_azure

Note 1: Replace the value for subscriptionId, resourceGroup, tenantId, Login, passwd with the appropriate value from your Azure account. This also applies to the azr-rd01 and azr-rd02 values above, those should be replaced with your actual hostnames. 

Note 2: Option pcmk_host_map is ONLY required in the command, if the hostnames and the Azure VM names are NOT identical. Specify the mapping in the format hostname:vm-name. Refer to the bold section in the command.

12. Monitor the “crm status” output and verify that the fencing agent is started. The result should be similar to the following example.

  * rsc_st_azure        (stonith:fence_azure_arm):       Started azr-rd02

13. Restart your Db2 instance by issuing the db2start command on both hosts. Furthermore, ensure that the HADR_PEER_WINDOW for all your automated HADR databases is set to at least 300 seconds.


14. Re-activate the HADR databases to re-enable automation for your HADR databases.

After this, the cluster is enabled to use the Azure fencing agent and we recommend performing a series of tests to validate that the setup works as planned. 

Managing, Starting and Stopping fencing resources in the cluster

With the db2cm utility, you can manage the cluster and enable or disable the cluster. The db2cm utility however only manages resources, created by db2cm. As the Azure fencing agent is not created using db2cm, you must manage the fencing agents manually.

So, before using db2cm to disable the cluster, you must set the Azure fencing agent to unmanaged state separately prior disabling the cluster with db2cm. To do so, unmanage the resources and check the status first using following commands:

crm resource unmanage rsc_st_azure

crm resource status

As a result, you will see the status for the fencing agent similar to the following:

  * rsc_st_azure        (stonith:fence_azure_arm):       Started azr-rd02 (unmanaged)

After the fencing agent is in status “unmanaged”, you can use the db2cm utility to disable the cluster:

   /db2/db2pa2/sqllib/bin/db2cm -disable -all

The same is true when you enable a cluster with the db2cm utility. The azure fencing agents may need to be enabled separately. To do so, check the status of the cluster after it has been started with “db2cm -enable -all”, set the resource to managed and start the resources if required using following commands:

   crm resource status rsc_st_azure

   crm resource manage rsc_st_azure

   crm resource start rsc_st_azure

Remove fencing resources from the cluster

With the db2cm utility, you can add and remove resources in the cluster or remove the cluster completely. The db2cm utility however only removes resources, created by db2cm. As the Azure fencing agent is not created using db2cm, you must remove the fencing agent manually if you want to operate the cluster without Azure fencing agents and also prior deleting the cluster entirely.

To do so, perform following steps:

crm configure property stonith-enabled=false

crm configure delete rsc_st_azure –force

crm resource refresh

If you want to delete the entire cluster, you can use db2cm -delete -cluster

If you permanently remove the fencing agent from the cluster, remove the custom role "Linux Fence Agent Role" from the Service Principal on both nodes in the cluster.

  • Go to https://portal.azure.com
  • Open All resources in the menu on the left.
  • Select the virtual machine of the first cluster node and click Access control (IAM)
  • Click “Role assignment” and remove the application for fencing for example “PCMK1” from the "Linux Fence Agent Role"

Repeat the steps above for the second cluster node.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU058","label":"IBM Infrastructure w\/TPS"},"Product":{"code":"SSEPGG","label":"Db2 for Linux, UNIX and Windows"},"ARM Category":[{"code":"a8m500000008PkmAAE","label":"High Availability"}],"ARM Case Number":"","Platform":[{"code":"PF025","label":"Platform Independent"}],"Version":"11.5.6;and future releases"}]

Document Information

Modified date:
10 November 2022

UID

ibm16465977