Red Hat Enterprise Linux High Availability Add-on

Follow this step-by-step procedure to set up a working Red Hat Enterprise Linux High Availability Add-On (Red Hat HA) setup.

These steps are described in detail in the following sections.

Install Red Hat HA
Configure Quorum
Configure Fencing
Add Fence levels
Add highly available shared storage
Install bastion

Install Red Hat HA

The following steps are performed on all OpenShift LPARs:

subscription-manager register --auto-attach
dnf config-manager --set-enabled rhel-8-for-s390x-highavailability-rpms
yum update -y
yum install -y pcs pacemaker fence-agents-all

Add firewall exceptions:

firewall-cmd --permanent --add-service=high-availability
firewall-cmd --reload

Create the Linux user that is used by the cluster:
```
passwd hacluster
```
Note: The same password for each node is recommended.
Enable and start the cluster controlling and configuration daemon:
```
systemctl enable pcsd.service --now
systemctl status pcsd.service
```
The following steps are performed on one OpenShift LPAR.

Authenticate the nodes:
```
pcs host auth halp01 halp02 halp03
#Username: hacluster
#Password: ...
```

Set up the cluster (replace name my_cluster):

pcs cluster setup my_cluster halp01 halp02 halp03

Start and enable the cluster service:

pcs cluster start --all
pcs cluster enable --all

The pcs CLI tool allows you to configure the cluster (Pacemaker + Corosync) and view the status.

Full cluster status:

pcs status --full
# Pacemaker xml configuration
pcs cluster cib

Quorum configuration

The following steps are performed on one OpenShift LPAR:

With wait_for_all enabled the whole cluster only becomes quorate/functional for the first time when all cluster members are available:
```
pcs quorum update wait_for_all=1
```
Note: For example, if you start three LPARs consecutively without enabling wait_for_all, the last LPAR might be fenced from the two already available LPARs.
totem token timeout specifies the time in milliseconds until a token loss is reported:
```
pcs cluster config update totem token=5000
```
Note: For totem token limits, check out the Corosync support policies.

Check that the totem token timeout is activated:
```
corosync-cmapctl | grep "runtime.*totem.token "
```

Fencing

For fencing, the following three levels are implemented:

fence_ibmz (Level 2): The main fencing method is power fencing over the HMC. It provides solid fencing because it is a power fencing method which triggers fencing externally via the HMC API.
fence_sbd (Level 3): When the HMC is not available, SBD is used as backup fence agent. As a last resort, self-fencing is a reliable backup option which might take a bit longer but should take effect in the worst cases. The poison pill is used to speed up this fencing method in some failure cases.
fence_kdump (Level 1): Included for debugging purposes. When you take a kdump (either automatically or manually) you want to prevent other fencing methods to trigger. This way the fencing is considered successful when a LPAR kdumps.

fence_ibmz

HMC prerequisites before setting up fence_ibmz

For the fence_ibmz fence agent, a HMC user is needed with the following rights: Deactivate, Activate, Load, View Activation Profiles. Those rights are listed under Summary -> Tasks:

Figure 1. Summary for user4fencing user on HMC.

At the bottom, Objects shows that the user is scoped to the cluster members (halp01, halp02 and halp03) only.
Allow access to the Web Services management interface:

Figure 2. Allow access to Web Services management interfaces.

Keep in mind how the timeouts are configured. The default values are usually very high, which should not affect fencing actions.

Figure 3. Session timeouts
In the activation profile of each LPAR you might want to set the option load during activation. This has the advantage to skip the additional load task triggered by the fence agent. This option is required when you use SCSI disks (instead of DASDs). With DASDs this option is optional.

Figure 4. Load during activation
Perform following steps on all OpenShift LPARs:

To reach the HMC in a secure way from the LPARs the TLS root certificate and intermediate CA certificates must be trusted.

The file CA_CERT.pem contains the root certificate and all intermediate CA certificates used in the chain of the HMC server certificate. You can view the chain with most Web-browsers. The Format of the file is PEM.

Copy the CA certificate file to the appropriate location and update the certificate authority with the associated trust:
```
cp CA_CERT.pem /etc/pki/ca-trust/source/anchors/
update-ca-trust
```
Verify that the certificates are in the trust store:
```
trust list | less
```
Verify that the complete certificate chain to the HMC is trusted:
```
openssl s_client -showcerts -connect ${HMC_URL}:443 -verify_return_error < /dev/null
```

Install fence_ibmz

Perform the following steps on all OpenShift LPARs:

fence_ibmz is installed via upstream because the fence_ibmz package is not available via the package manager at the time of writing:
Note: In newer RHEL versions fence_ibmz might be already installed with the fence-agents-all package. In this case skip to step 4.
```
dnf install -y wget
wget https://raw.githubusercontent.com/ClusterLabs/fence-agents/master/agents/ibmz/fence_ibmz.py
```

Replace platform-specific variables in the fence_ibmz code:

sed -i 's+@PYTHON@+/usr/libexec/platform-python+' fence_ibmz.py
sed -i 's+@FENCEAGENTSLIBDIR@+/usr/share/fence+' fence_ibmz.py

Copy fence_ibmz.py to a common location for fence agents and make the script executable:
```
cp fence_ibmz.py /usr/sbin/fence_ibmz
chmod +x /usr/sbin/fence_ibmz
```
Verify that fence_ibmz is visible and installed:
```
pcs stonith list|grep fence_ibmz
```

Add fence_ibmz to the cluster

Perform on one OpenShift LPAR:

Add the fence agent to the cluster:


pcs stonith create fence_ibmz fence_ibmz \
            ip=${HMC_URL} \
            username="${HMC_USER}" \
            password="${HMC_USER_PASSWORD}" \
            ssl_secure=true \  
            pcmk_host_map="halp01:ZZ1/HALP01;halp02:ZZ1/HALP02;halp03:ZZ1/HALP03"

Note: pcmk_host_map: the second value is case-sensitive! It is possible to hide the password by providing a password_script as described here.

If you decided to enabled load_on_activate in the activation profile of each LPAR in the HMC (see previous step) then set the load_on_activate attribute as well:
```
pcs stonith update fence_ibmz load_on_activate=true
```
Add debug log output for verification:
```
pcs stonith update fence_ibmz verbose=1 debug_file=/tmp/fence_ibmz.log
```
Trigger fencing manually to verify the fence agent:
```
pcs stonith fence halp02
cat /tmp/fence_ibmz.log | less
```
Note: stonith-timeout and stonith-action options might be ignored when triggering manual fencing as described here.

Further considerations (recommended)

The stonith-timeout defines how long to wait until the STONITH action (for example on, off) is completed the defualt is 60 seconds. It can be overridden by pcmk_xxx_timeout on fence agent basis.

Because the deactivation process on the HMC can take up to 900 seconds (default), you can override the STONITH action timeout for the fence agent.

In the following code snippet, the timeout for each operation (off and on) is set to 905 seconds (perform on one OpenShift LPAR):

pcs stonith update fence_ibmz \ 
                   pcmk_reboot_timeout=1810 \
                   pcmk_off_timeout=905 \
                   pcmk_on_timeout=905

Note: pcmk_reboot_timeout is not relevant here, because the fence operation internally maps the reboot action to off and on.

Note: Those timeouts may become relevant when the fencing target (LPAR) runs at full capacity or is still sporadically responsive. In those cases the HMC shutdown task can take significantly longer.

fence_sbd

fence_sbd prerequisite - watchdog

Perform on all OpenShift LPARs:

SBD uses watchdogs to monitor the systems. For IBM zSystems the diag288_wdt hardware watchdog is preferred. To enable the watchdog you have to load the diag288_wdt kernel module:
```
modprobe diag288_wdt
```
To verify that the watchdog is loaded,use following command:
```
wdctl
```
Load the kernel module across reboots, run:
```
echo "diag288_wdt" > /etc/modules-load.d/watchdog.conf
```
Note: The watchdog timeout cannot be lower then 15 seconds. Source

fence_sbd prerequisite - shared disk for poison pill

Perform on one OpenShift LPAR:

For the poison pill communication, a shared disk is needed. In the following, a DASD disk is used.

Attach the DASD, format it, and create a partition (replace dasd with the appropriate DASD of your environment):

# attach/enable dasd
chzdev -e dasd 0.0.0200
# format dasd
dasdfmt -b 4096 -d cdl -p /dev/disk/by-path/ccw-0.0.0200
# creates a partition over the entire disk automatically
fdasd -a /dev/disk/by-path/ccw-0.0.0200

For SBD to work, a header is written to the previously created partition:

pcs stonith sbd device setup \
            --device=/dev/disk/by-path/ccw-0.0.0200-part1 \
            watchdog-timeout=15 \
            msgwait-timeout=30

Verify the status of SBD by viewing the full status:
```
pcs stonith sbd status --full
```
Enable the SBD daemon:
```
pcs stonith sbd enable \
watchdog=/dev/watchdog \
device=/dev/disk/by-path/ccw-0.0.0200-part1 \
SBD_DELAY_START=60 SBD_WATCHDOG_TIMEOUT=15
```
Notes:
- SBD_* are environment variables for the SBD systemd service.
- SBD_WATCHDOG_TIMEOUT only applies when SBD runs in diskless mode. When disks are defined the watchdog timer written to the disk header is used.
- The diag288 watchdog minimum timeout is 15 seconds, see here
- SBD_DELAY_START postpones the start of the pacemaker systemd daemon
- SBD_DELAY_START should be longer then: corosync token timeout (5) + consensus timeout (6) + pcmk_delay_max (0) + msgwait (30) = 41 seconds. Otherwise, Pacemaker might start with exit code 100.
To make the changes active, restart the cluster:
```
pcs cluster stop --all
pcs cluster start --all
```
Show the default power_timeout, which indicates how long the fencing process waits before Pacemaker must be up again after fencing:
```
fence_sbd -o metadata|grep -A 2 power_timeout
```
Create the SBD fence agent and set the power_timeout:
```
pcs stonith create fence_sbd fence_sbd \
devices="/dev/disk/by-path/ccw-0.0.0200-part1" \
power_timeout=45
```
Note: power_timeout should be bigger then msgwait timeout
Show and verify the settings of the SBD fence agent:
```
pcs stonith config
```
Make sure to test SBD fencing before continuing:
1. Send messages:
```
sbd -d /dev/disk/by-path/ccw-0.0.0200-part1 message halp01 test
```
2. Look at the Log of the SBD systemd service:
```
journalctl -u sbd -f
```
3. Send poison pills:
```
# make sure to disable other fencing methods first:
pcs stonith disable fence_ibmz
# test
time pcs stonith fence halp01
pcs stonith enable fence_ibmz
```
  The target system should reboot and the Pacemaker systemd service should be delayed by the SBD systemd service by msgwait-timeout seconds.
4. Further debugging options:
  - Increase the SBD verbosity by adding a -v to the SBD_OPTS in /etc/sysconfig/sbd.
  - Look at systemd startup
```
systemd-analyze critical-chain
```

fence_kdump

Perform on all OpenShift LPARs:

Enable and start the kdump systemd service:
```
systemctl enable kdump --now
```

Add the required firewall rule for kdump's communication to fence_kdump:

firewall-cmd --add-port=7410/udp --permanent
systemctl reload firewalld 
systemctl restart firewalld

Edit the kdump configuration /etc/kdump.conf:
1. Set the port to send the kdump message to (see man fence_kdump_send) and set the interval to every 10 seconds (repeating forever):
```
fence_kdump_args -p 7410 -f auto -c 0 -i 10
```
2. All hostnames (cluster members) to send the kdump message to:
```
fence_kdump_nodes halp01 halp02 halp03
```
  The system that runs fence_kdump at that time receives the message.
Restart the kdump service:
```
systemctl restart kdump
```

Add the kdump fence agent to the cluster:

pcs stonith create kdump fence_kdump \
            pcmk_reboot_action="off" \
            example.com_list="halp01 halp02 halp03" \
            verbose=1

Helpful links:

Add Fence Levels

Perform on one OpenShift LPAR:

Use fence levels to set the order in which the fence agents are run. Fence levels are run from low to high. Adjust regexp to match the LPAR names of your environment (halp01, halp02, halp03).
```
pcs stonith level add 1 regexp%lpar[0-9] fence_kdump
pcs stonith level add 2 regexp%lpar[0-9] fence_ibmz
pcs stonith level add 3 regexp%lpar[0-9] fence_sbd
```
Show and verify the fencing levels:
```
pcs stonith level
pcs stonith level verify
```
If you made a mistake you can remove entire levels like this:
```
pcs stonith level remove 1
```
Additional configurations
1. When your fencing is misconfigured, or the node still has a healthy cluster communication (for example, if you are using fabric fencing) the node to be fenced is notified of its own fencing. In this case the fence-reaction property decides what happens. panic is the safest choice and reboots the node.
```
pcs property set fence-reaction=panic
```
2. To prevent multiple fencing operations in parallel, you can disable concurrent-fencing. In a 3-node cluster (that can only withstand the failure of one node), you might not need concurrent-fencing:
```
pcs property set concurrent-fencing=false
```

Add highly available shared storage

Previously, you added the shared SCSI disks to the systems and created a partition on each of them. In this topic, the partitions are mounted to one OpenShift LPAR in an active/passive fashion.

Perform on one OpenShift LPAR:

Before you can add the shared SCSI LUNs to the cluster, logical volumes are needed for each SCSI LUN. Repeat following steps with each SCSI LUN that is used by KVM guest (replace WWID with the appropriate WWID):
```
pvcreate /dev/disk/by-id/scsi-3600507630bffc320000000000000e027-part1
vgcreate bastion-vg /dev/disk/by-id/scsi-3600507630bffc320000000000000e027-part1
lvcreate -n bastion-lv -l 100%FREE bastion-vg
mkfs.ext4 /dev/bastion-vg/bastion-lv
```
Repeat the above process for control[0-2] and compute[0-1] and bootstrap.

Note: It is not required to make the bootstrap node LUN higly available as you only need it during installation. However,one advantage of adding it, is the possibility to repurpose the bootstrap to a compute-node after installation.

Create all mounting directories:

mkdir -p /mnt/control0-mnt/images
mkdir -p /mnt/control1-mnt/images
mkdir -p /mnt/control2-mnt/images
mkdir -p /mnt/compute0-mnt/images
mkdir -p /mnt/compute1-mnt/images
mkdir -p /mnt/bootstrap-mnt/images
mkdir -p /mnt/bastion-mnt/images

Create the required Red Hat HA resources for each SCSI LUN. Perform the following steps for bastion, control 0-2, compute 0-1 and bootstrap (using the appropriate values of your environment):

pcs resource create bastion-lvm LVM-activate vgname=bastion-vg vg_access_mode=system_id --group bastion-group
pcs resource create bastion-fs Filesystem device="/dev/bastion-vg/bastion-lv" directory="/mnt/bastion-mnt" fstype="ext4" --group bastion-group

Make sure the resource has no errors and is mounted under the LPAR you are currently working with. If the resource group is not currently mounted on the system you are currently working with, use the following command to move the resource-group:
```
pcs resource move group-name destination-lpar
```
To make sure libvirt has sufficient permissions for the new directories. You also need manage Selinux. The man page suggests the usage of the svirt_home_t selinux type for custom directories. However, the virt_image_t type also works correctly. The following command creates a regex rule to label each file under the used mount directories with virt_image_t. The restorecon commmand triggers a relabeling of the files and directories recursively.
```
semanage fcontext --add -t virt_image_t -f f '/mnt(/[^/]*-mnt)(/.*)?'
restorecon -R -v '/mnt'
# if already customized use force (-F):
# restorecon -F -R -v '/mnt'
```

Install bastion

Prepare bastion KVM guest

Perform on one OpenShift LPAR where all the storage from above is currently mounted:

Create the qcow image required for the bastion installation under the mounted highly available shared SCSI LUN:
```
qemu-img create -f qcow2 /mnt/bastion-mnt/images/bastion-disk.qcow2 100G
```

Install RHEL 8.5 on the bastion disk (replace values with appropriate values of your environment):

virt-install --name bastion \
--memory 8192 \
--vcpus 4 \
--disk /mnt/bastion-mnt/images/bastion-disk.qcow2 \
--location http://redhat-dvd-hosting-server.corp/redhat/s390x/RHEL8.5-latest/DVD/ \
--os-variant "rhel8.5" \
--network "network=default,mac=02:9B:16:BB:BB:BB" \
--network "network=macvtap-10-bridge,mac=02:9B:17:BB:BB:BB" \
--initrd-inject "/root/bastion.ks" \
--extra-args "inst.ks=file:///bastion.ks" \
--noautoconsole

Shut down the bastion guest and dump the xml guest description to the highly available SCSI disk:

virsh shutdown bastion

virsh dumpxml bastion > /mnt/bastion-mnt/bastion-guest.xml

virsh destroy bastion
virsh undefine bastion

To make sure the selinux labels are correct, everything under /mnt is relabelled:
```
restorecon -R -v '/mnt'
```

Adding the KVM guest as cluster resource

Before adding the KVM guest as resource to the cluster, make sure all OpenShift LPARs can reach each other via SSH, for example (perform on all OpenShift LPARs):
```
ssh-keygen
ssh-copy-id root@halp01
ssh-copy-id root@halp02
ssh-copy-id root@halp03
```
Note: This is a prerequisite for the VirtualDomain resource option migration_transport=ssh (see next step). This option means that when a KVM guest is moved to a different LPAR ssh is used for communication.

Add the KVM guest as VirtualDomain resource to the high availability cluster by performing the following step on the OpenShift LPAR that has all the storage from above attached:

pcs resource create bastion-guest \
    ocf:heartbeat:VirtualDomain \
    config="/mnt/bastion-mnt/bastion-guest.xml" \
    hypervisor="qemu:///system" \
    migration_transport="ssh" \
    meta allow-migrate=false \
    --group bastion-group

Congratulations, you just made your first KVM guest highly available.