Minimizing downtime on Azure Red Hat OpenShift clusters that are deployed across multiple availability zones

If you installed IBM Software Hub on an Azure Red Hat OpenShift (ARO) cluster that is deployed across multiple availability zones, you can use the scheduling service node scoring configuration to minimize downtime when a zone failure occurs.

Who needs to complete this task?: Cluster administrator A cluster administrator must complete this task.
When do you need to complete this task?: This task applies only if you installed IBM Software Hub on an ARO cluster that is deployed across multiple availability zones.

Before you begin

To use node scoring, the scheduling service must be installed.

To check whether the scheduling service is installed, run the following command:

oc get scheduling -A

If the scheduling service is installed, the command returns information about the project where the scheduling service is installed and the version that is installed.
If the scheduling service is not installed, the command returns an empty response.
To install the scheduling service, see Installing shared cluster components for IBM Software Hub.

Best practice: You can run many of the commands in this task exactly as written if you set up environment variables for your installation. For instructions, see Setting up installation environment variables.

Ensure that you source the environment variables before you run the commands in this task.

About this task

To minimize downtime when a zone failure occurs, use the scheduling service to configure pod topology spread constraints. Pod topology spread constraints control how pods are scheduled across topology domains, such as nodes or zones.

Update the PodTopologySpread plug-in to control how pods are distributed across nodes and zones:

Distributing pods across nodes

Use the kubernetes.io/hostname label key to minimize the difference in the number of pods across nodes based on their hostname.

Set the maxSkew parameter to 1 to indicate that the number of pods on any node should not differ by more than 1 from any other node.

Set the whenUnsatisfiable parameter to ScheduleAnyway to indicate that the scheduling service should still schedule the pod even if it cannot satisfy the spread constraint. However, the scheduling service will prioritize nodes that minimize the skew.

Distributing pods across zones

Use the topology.kubernetes.io/zone label key to minimize the difference in the number of pods across availability zones.

Set the maxSkew parameter to 1 to indicate that the number of pods in any availability zone should not differ by more than 1 from any other availability zone.

The pods associated with an instance of IBM Software Hub will be scheduled when both of the preceding constraints or satisfied.

If one or both constraints cannot be satisfied, the scheduling service will prioritize nodes that minimize the skew across nodes and availability zones.

Procedure

Check the current scaling configuration of the scheduling service:

cpd-cli manage get-cluster-component-scale-config \
--cluster_component_ns=${PROJECT_SCHEDULING_SERVICE} \
--component=scheduler \
--release=${VERSION}

If the command returns level_3 (medium), level_4 (large), or level_5 (large), proceed to the next step.

If the command returns level_1 (small_mincpureq) or

level_2
(small)

, scale the scheduling service to level_3:

cpd-cli manage apply-cluster-component-scale-config \--cluster_component_ns=${}PROJECT_SCHEDULING_SERVICE \
--cluster_component_ns=${PROJECT_SCHEDULING_SERVICE} \
--component=scheduler \
--release=${VERSION} \
--scale=level_3

Print the contents of the ibm-cpd-scheduler-schedulerplugins ConfigMap to a file called topology-cm.yaml:

oc get cm ibm-cpd-scheduler-schedulerplugins \
-n ${PROJECT_SCHEDULING_SERVICE} \
-o yaml > topology-cm.yaml

Open the topology-cm.yaml file in a text editor.
In the profiles section, locate the pluginConfig section:
```
 profiles:
    - schedulerName: ibm-cpd-scheduler
      pluginConfig:
```
If the section does not exist, add it.

Add the following configuration to the pluginConfig section:

- name: PodTopologySpread
        args:
          defaultConstraints: 
            - maxSkew: 1
              topologyKey: "kubernetes.io/hostname"
              whenUnsatisfiable: ScheduleAnyway
            - maxSkew: 1
              topologyKey: "topology.kubernetes.io/zone"
              whenUnsatisfiable: ScheduleAnyway
          defaultingType: List

Ensure that you indent the configuration correctly. For example:

profiles:
    - schedulerName: ibm-cpd-scheduler
      pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints: 
            - maxSkew: 1
              topologyKey: "kubernetes.io/hostname"
              whenUnsatisfiable: ScheduleAnyway
            - maxSkew: 1
              topologyKey: "topology.kubernetes.io/zone"
              whenUnsatisfiable: ScheduleAnyway
          defaultingType: List

Save your changes to the topology-cm.yaml file.

Delete the ibm-cpd-scheduler-schedulerplugins ConfigMap:

oc delete cm ibm-cpd-scheduler-schedulerplugins \
-n ${PROJECT_SCHEDULING_SERVICE}

Create the ibm-cpd-scheduler-schedulerplugins ConfigMap from the topology-cm.yaml file:
```
oc create -f topology-cm.yaml
```

Restart the scheduling service pods to pick up the updated ibm-cpd-scheduler-schedulerplugins ConfigMap:

oc get po -n ${PROJECT_SCHEDULING_SERVICE} | grep ibm-cpd-scheduler-scheduler | while read a b; do
  oc delete pod -n ${PROJECT_SCHEDULING_SERVICE} $a
done

Results

If a zone failure occurs, the topology spread minimizes downtime.