Restarting the Cloud Pak for AIOps cluster

Before a known maintenance window or outage, you might need to shut down the Red Hat OpenShift cluster where IBM Cloud Pak for AIOps is installed, and then restart it.

Warning: If you need to shut down the Red Hat® OpenShift® Container Platform cluster where IBM Cloud Pak® for AIOps is installed, then you must use the following procedure. Failure to do so can result in data loss or corruption.

Procedure

Validate the installation
Check the certificates
Prepare to scale down
Scale down the workloads and drain the nodes
Shut down the cluster
Restart the cluster
Scale up the workloads
Validate the installation

1. Validate the installation

Export the namespace where Cloud Pak for AIOps is installed:
```
export AIOPS_NAMESPACE=<project>
```

Run the describe command:

oc describe installations.orchestrator.aiops.ibm.com -n "${AIOPS_NAMESPACE}"

Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.

Example output can resemble the following:

Name:         ibm-cp-aiops
Namespace:    aiops
API Version:  orchestrator.aiops.ibm.com/v1alpha1
Kind:         Installation
Spec:
...
Status:
Componentstatus:
   Aimanager:                       Ready
   Aiopsanalyticsorchestrator:      Ready
   Aiopsedge:                       Ready
   Aiopsui:                         Ready
   Asm:                             Ready
   Baseui:                          Ready
   Cluster:                         Ready
   Commonservice:                   Ready
   Elasticsearch:                   Ready
   Flinkcluster:                    Ready
   Issueresolutioncore:             Ready
   Kafka:                           Ready
   Lifecycleservice:                Ready
   Lifecycletrigger:                Ready
   Rediscp:                         Ready
   Tunnel:                          Ready
   Zenservice:                      Ready
Phase:                   Running

2. Check the certificates

Ensure that none of the certificates have problems or are expired.

Run the following command:

while read l; do echo "$l" | grep '^NAME' || (n=$(echo $l | sed 's/ .*//'); s=$(echo $l | sed 's/^[^ ]* *\([^ ]*\).*/\1/'); x=$(oc get secret -n $n $s -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -noout -enddate 2>/dev/null | sed 's!notAfter=!!'); echo "$l" | sed 's![^ ][^ ]*$!'"$x"'!'); done< <(oc get secret -A --field-selector=type==kubernetes.io/tls -o custom-columns=NAMESPACE:.metadata.namespace,NAME:.metadata.name,EXPIRY:.metadata.name)

Example output excerpt:

ibm-licensing   ibm-license-service-cert                                       Jan  8  13:32:07  2025  GMT
ibm-licensing   ibm-license-service-cert-internal                              Jan  7  13:31:12  2026  GMT
ibm-licensing   ibm-licensing-service-prometheus-cert                          Jan  7  13:31:25  2026  GMT
cp4aiops        aimanager-aio-log-anomaly-feedback-learning-cert               Jan  7  14:01:43  2026  GMT
cp4aiops        aimanager-aio-log-anomaly-golden-signals-cert                  Jan  7  14:01:43  2026  GMT
cp4aiops        aimanager-aio-oob-recommended-actions-cert                     Jan  7  14:01:43  2026  GMT
<...>

If any certificates have problems, are expired, or will expire before the cluster is restarted, then see Renew or re-create certificates in Openshift 4.x Opens in a new tab .

3. Prepare to scale down

Set environment variables.

Make a note of these environment variables or save them to a file, as you will need to export them again after you restart your cluster.
```
export AIOPS_NAMESPACE=<project>
export AIOPS_INSTANCE=$(oc get installation -o jsonpath='{.items[0].metadata.name}' -n ${AIOPS_NAMESPACE})
export WORKER_COUNT=<worker_count>
export CLUSTER_NAME=<cluster_hostname>
```
Where
- <project> is the namespace (project) that your IBM Cloud Pak for AIOps installation is deployed in.
- <worker_count> is the number of worker nodes in your cluster.
- <cluster_hostname> is the host name of your Red Hat OpenShift Container Platform cluster.

Cordon the worker nodes.

for i in `seq 0 "$(expr ${WORKER_COUNT} - 1)"`
do
oc adm cordon "worker${i}.${CLUSTER_NAME}"
sleep 2
done

Make a note of the number of replicas.

Make a note of the number of replicas for each StatefulSet.

oc get statefulsets -n ${AIOPS_NAMESPACE}

Example output:

NAME                                               READY   AGE
aimanager-ibm-minio                                1/1     18m
aiops-ir-analytics-spark-worker                    2/2     33m
aiops-ir-core-ncobackup                            1/1     37m
aiops-ir-core-ncoprimary                           1/1     39m
aiops-ir-lifecycle-eventprocessor-ep-jobmanager    1/1     17m
aiops-ir-lifecycle-eventprocessor-ep-taskmanager   1/1     16m
aiops-topology-cassandra                           1/1     43m
c-example-couchdbcluster-m                         1/1     40m
cp4waiops-eventprocessor-eve-29ee-ep-jobmanager    1/1     47m
cp4waiops-eventprocessor-eve-29ee-ep-taskmanager   2/2     45m
iaf-system-elasticsearch-es-aiops                  1/1     49m
ibm-cp-aiops-redis-server                          3/3     45m
zen-minio                                          3/3     40m

Note:

If you do not have a IBM® Netcool® Operations Insight® probe integration, then aiops-ir-core-ncobackup and aiops-ir-core-ncoprimary has zero replicas.
If you have a base deployment, then you will not have cp4waiops-eventprocessor-eve-29ee-ep-jobmanager and cp4waiops-eventprocessor-eve-29ee-ep-taskmanager.
If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you will also have an icp-mongodb StatefulSet.

Make a note of the number of replicas for each StrimziPodSet.

oc get strimzipodset -n ${AIOPS_NAMESPACE}

Example output:

NAME                                  PODS   READY PODS   CURRENT PODS   AGE
iaf-system-kafka                      3      3            3              13d
iaf-system-zookeeper                  3      3            3              13d

4. Scale down the workloads and drain the nodes

Scale down the operator deployments in the IBM Cloud Pak for AIOps namespace.

Run the following command to create two scripts named aiops-operator-scale-down.sh and aiops-operator-scale-up.sh. The scripts get the current replica count for all deployments, and then scale the replicas up or down.

oc get deploy -n "${AIOPS_NAMESPACE}" -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n '"${AIOPS_NAMESPACE}"'  %s --replicas=0\n" .metadata.name }}{{end}}' > aiops-operator-scale-down.sh

oc get deploy -n "${AIOPS_NAMESPACE}" -l olm.owner.kind=ClusterServiceVersion -o go-template='{{range .items}}{{printf "oc scale deploy -n '"${AIOPS_NAMESPACE}"'  %s --replicas=%d\n" .metadata.name .spec.replicas }}{{end}}' > aiops-operator-scale-up.sh

Scale down the operators.

Run the following commands:

chmod +x ./aiops-operator-scale-down.sh
./aiops-operator-scale-down.sh

Run the following command to check that the number of replicas for each of the operator deployments is now 0.

oc get deployment -n ${AIOPS_NAMESPACE} -l olm.owner.kind=ClusterServiceVersion

Example output:

NAME                                              READY   UP-TO-DATE   AVAILABLE   AGE
aimanager-operator-controller-manager             0/0     0            0           47m
aiopsedge-operator-controller-manager             0/0     0            0           47m
asm-operator                                      0/0     0            0           47m
iaf-flink-operator-controller-manager             0/0     0            0           54m
ibm-aiops-orchestrator-controller-manager         0/0     0            0           58m
ibm-common-service-operator                       0/0     0            0           56m
ibm-commonui-operator                             0/0     0            0           53m
ibm-elastic-operator-controller-manager           0/0     0            0           54m
ibm-events-operator-v5.0.1                        0/0     0            0           54m
ibm-iam-operator                                  0/0     0            0           54m
ibm-ir-ai-operator-controller-manager             0/0     0            0           47m
ibm-redis-cp-operator                             0/0     0            0           49m
ibm-secure-tunnel-operator                        0/0     0            0           48m
ibm-watson-aiops-ui-operator-controller-manager   0/0     0            0           48m
ibm-zen-operator                                  0/0     0            0           54m
ir-core-operator-controller-manager               0/0     0            0           47m
ir-lifecycle-operator-controller-manager          0/0     0            0           47m
operand-deployment-lifecycle-manager              0/0     0            0           55m
postgresql-operator-controller-manager-1-18-12    0/0     0            0           54m

Note: If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you will also have an icp-mongodb-operator deployment.

Scale down the StatefulSets that you noted in step 3.3.

You can use the Cloud Pak for AIOps console, or create a shell script to do this.

If you have a base deployment, then remove the following lines from the example shell script:

oc scale statefulsets cp4waiops-eventprocessor-eve-29ee-ep-jobmanager --replicas=0 -n ${AIOPS_NAMESPACE}
oc scale statefulsets cp4waiops-eventprocessor-eve-29ee-ep-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}

If you upgraded from an earlier version of IBM Cloud Pak for AIOps, then add the following line to the example shell script:

oc scale statefulsets icp-mongodb --replicas=0 -n ${AIOPS_NAMESPACE}

Example shell script:

#!/bin/bash

oc scale statefulsets aimanager-ibm-minio --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets ${AIOPS_INSTANCE}-redis-server --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets aiops-ir-analytics-spark-worker --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets aiops-ir-core-ncobackup --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets aiops-ir-core-ncoprimary --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets aiops-ir-lifecycle-eventprocessor-ep-jobmanager --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets aiops-ir-lifecycle-eventprocessor-ep-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets aiops-topology-cassandra --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets c-example-couchdbcluster-m --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets cp4waiops-eventprocessor-eve-29ee-ep-jobmanager --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets cp4waiops-eventprocessor-eve-29ee-ep-taskmanager --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets iaf-system-elasticsearch-es-aiops --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2
oc scale statefulsets zen-minio --replicas=0 -n ${AIOPS_NAMESPACE}
sleep 2

Run the following command to check that the number of replicas for each of the StatefulSets is now 0.

oc get statefulsets -n ${AIOPS_NAMESPACE}

Example output:

NAME                                               READY   AGE
aimanager-ibm-minio                                0/0     112m
aiops-ir-analytics-spark-worker                    0/0     128m
aiops-ir-core-ncobackup                            0/0     131m
aiops-ir-core-ncoprimary                           0/0     133m
aiops-ir-lifecycle-eventprocessor-ep-jobmanager    0/0     112m
aiops-ir-lifecycle-eventprocessor-ep-taskmanager   0/0     111m
aiops-topology-cassandra                           0/0     138m
c-example-couchdbcluster-m                         0/0     134m
cp4waiops-eventprocessor-eve-29ee-ep-jobmanager    0/0     142m
cp4waiops-eventprocessor-eve-29ee-ep-taskmanager   0/0     139m
iaf-system-elasticsearch-es-aiops                  0/0     143m
ibm-cp-aiops-redis-server                          0/0     140m
zen-minio                                          0/0     134m

Note:

If you have a base deployment, then you will not have cp4waiops-eventprocessor-eve-29ee-ep-jobmanager and cp4waiops-eventprocessor-eve-29ee-ep-taskmanager.
If you upgraded from an earlier version of IBM Cloud Pak for AIOps, you will also have an icp-mongodb StatefulSet.

Shutdown the Kafka and ZooKeeper pods.

oc delete pod -l ibmevents.ibm.com/name=iaf-system-kafka -n ${AIOPS_NAMESPACE}
oc delete pod -l ibmevents.ibm.com/name=iaf-system-zookeeper -n ${AIOPS_NAMESPACE}

Run the following command to check that the Kafka and ZooKeeper pods have successfully shutdown. If the shutdown is complete, no pods are returned.

oc get pod -l ibmevents.ibm.com/controller=strimzipodset -n ${AIOPS_NAMESPACE}

Scale down the PostgreSQL pods.

When shutting down a PostgreSQL cluster, it is best to remove the primary replica last. The following script removes each database replica in the cluster with the primary removed last.

Before running the script, replace <project> with the namespace (project) that your IBM Cloud Pak for AIOps installation is deployed in.

#!/bin/bash

AIOPS_NAMESPACE=<project>

# Get array of Postgres clusters
CLUSTERS=($(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" -o go-template='{{range .items}}{{.metadata.name}}{{" "}}{{end}}'))

# For each Postgres cluster, shutdown primary last
for cluster_name in "${CLUSTERS[@]}"; do
    primary=$(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{.status.currentPrimary}}')
    instances=($(oc get clusters.postgresql.k8s.enterprisedb.io -n "${AIOPS_NAMESPACE}" "${cluster_name}" -o go-template='{{range .status.instanceNames}}{{print . " "}}{{end}}'))
    for instance_name in "${instances[@]}"; do
        # Shutdown non-primary replicas
        if [ "${instance_name}" != "${primary}" ]; then
            oc delete pod -n "${AIOPS_NAMESPACE}" "${instance_name}" --ignore-not-found
        fi
    done

    # Shutdown the primary once all other replicas are down
    oc delete pod -n "${AIOPS_NAMESPACE}" "${primary}" --ignore-not-found
done

Wait for all the Postgres pods to be deleted. All pods are deleted when the following command returns no pods:

oc get pod -l k8s.enterprisedb.io/podRole=instance -n ${AIOPS_NAMESPACE}

(Optional) After the StatefulSets and StrimziPodSets are scaled down, drain the nodes by running the drain command manually, or by running a loop as follows:
```
for i in `seq 0 "$(expr ${WORKER_COUNT} - 1)"`
do
echo $i
oc adm drain "worker${i}.${CLUSTER_NAME}" --ignore-daemonsets --force --grace-period=30 --delete-emptydir-data
sleep 10
done
```
Note: Some pods, such as storage pods, do not stop because this would violate the disruption budget. If this problem occurs, run the commands in each node until only the storage pods are left, and then stop the command and restart the script at the next worker node. This action is necessary because otherwise the drain command will continuously retry, and the command will never proceed to the next worker node.

5. Shut down the cluster

Shut down all the worker nodes on the cluster.
Shut down all the master nodes on the cluster.
Shut down the API node on the cluster.

For more information about shutting down your cluster nodes, see step 4 in the Red Hat OpenShift documentation Shutting down a cluster gracefully Opens in a new tab .

6. Restart the cluster

Re-export the environment variables that you saved in step 3.1 Prepare to scale down.
Restart the cluster nodes in the following order:
1. Restart the API node.
2. Restart the master nodes. Check whether all master nodes are in ready status by running the following command:
```
oc get nodes
```
3. Restart the worker nodes. Check whether all worker nodes are in ready status by running the following command:
```
oc get nodes
```

After all the nodes are up, uncordon the nodes by running the following command:

for i in `seq 0 "$(expr ${WORKER_COUNT} - 1)"`
do
oc adm uncordon "worker${i}.${CLUSTER_NAME}"
echo $1
sleep 2
done

7. Scale up the workloads

Scaling up the workloads in the following order helps to minimize startup time and resource contention issues.

Scale the events operator back up.

oc scale deployment --replicas=1 $(oc get deployment -o custom-columns=NAME:.metadata.name --no-headers -n ${AIOPS_NAMESPACE}  | grep '^ibm-events-operator-') -n ${AIOPS_NAMESPACE}

Check whether the Kafka and Zookeeper pods are running again. This can take a few minutes.

oc get pod -l ibmevents.ibm.com/controller=strimzipodset -n ${AIOPS_NAMESPACE}

Example output when the Kafka and Zookeeper pods are running:

NAME                                    READY   STATUS    RESTARTS   AGE
iaf-system-kafka-0                      1/1     Running   0          13d
iaf-system-kafka-1                      1/1     Running   0          13d
iaf-system-kafka-2                      1/1     Running   0          13d
iaf-system-zookeeper-0                  1/1     Running   0          13d
iaf-system-zookeeper-1                  1/1     Running   0          13d
iaf-system-zookeeper-2                  1/1     Running   0          13d

Scale up each of the StatefulSets in the specified order, to the number of replicas noted in step 3.3.
- aiops-topology-cassandra
- iaf-system-elasticsearch-es-aiops
- aiops-ir-analytics-spark-worker
- aiops-ir-lifecycle-eventprocessor-ep-jobmanager
- aiops-ir-lifecycle-eventprocessor-ep-taskmanager
- cp4waiops-eventprocessor-eve-29ee-ep-jobmanager
- cp4waiops-eventprocessor-eve-29ee-ep-taskmanager
- aiops-ir-core-ncoprimary
- aiops-ir-core-ncobackup
- c-example-couchdbcluster-m
- ${AIOPS_INSTANCE}-redis-server
- aimanager-ibm-minio
- zen-minio
Note:
- If you have a base deployment, then do not scale up cp4waiops-eventprocessor-eve-29ee-ep-jobmanager and cp4waiops-eventprocessor-eve-29ee-ep-taskmanager.
- If you upgraded to IBM Cloud Pak for AIOps 4.6.0, you may have an icp-mongodb StatefulSet. This should not be scaled up.
Run the following command to rescale each StatefulSet:
```
oc scale statefulsets <statefulset> --replicas=<number of replicas> -n ${AIOPS_NAMESPACE}
```
Where:
- <statefulset> is the StatefulSet to be scaled up
- <number_of_replicas> is the number of replicas the StatefulSet it to be scaled up to
For example,
```
oc scale statefulsets aiops-topology-cassandra --replicas=1 -n cp4aiops
```
Scale up the operator deployments.

Run the aiops-operator-scale-up.sh script that you created earlier in step 4. Scale down the workloads and drain the nodes:
```
chmod +x ./aiops-operator-scale-up.sh
./aiops-operator-scale-up.sh
```

8. Validate the installation

Note: After a complete cluster restart, it might approximately take an hour for the installation to start running again.

Export the namespace where Cloud Pak for AIOps is installed:
```
export AIOPS_NAMESPACE=<project>
```
Run the describe command:
```
oc describe installations.orchestrator.aiops.ibm.com -n "${AIOPS_NAMESPACE}"
```
Review the ComponentStatus fields to confirm that all components are marked as Ready and the phase is Running.