Offline upgrade of IBM Cloud Pak for AIOps
Use these instructions to upgrade IBM Cloud Pak® for AIOps 4.6.0 or later to 4.7.0.
This procedure can be used on an offline deployment of IBM Cloud Pak for AIOps 4.6.0 or later, and can still be used if the deployment has had hotfixes applied. If you have an online deployment, follow the instructions in Upgrading IBM Cloud Pak for AIOps (online).
Before you begin
- Ensure that you are logged in to your Red Hat® OpenShift® Container Platform cluster with
oc login
for any steps that use the OpenShift command-line interface (CLI). - Red Hat OpenShift Container Platform requires a user with
cluster-admin
privileges for the following operations:
Warnings:
- Custom patches, labels, and manual adjustments to IBM Cloud Pak for AIOps resources are lost when IBM Cloud Pak for AIOps is upgraded, and must be manually reapplied after upgrade. For more information, see Manual adjustments are not persisted.
- If you previously increased the size of the Kafka PVC directly, then you must follow the correct procedure that is supplied in Resizing storage to ensure that the size is updated by the operator. Failure to do so before upgrading IBM Cloud Pak for AIOps causes the operator to attempt to restore a lower default value for the Kafka PVC, and causes an error in your IBM Cloud Pak for AIOps deployment.
Restrictions:
- You cannot use these instructions to upgrade deployments of IBM Cloud Pak for AIOps 4.5.1 or earlier. For more information, see Upgrade paths.
- The upgrade cannot be removed or rolled back.
- If you are planning to upgrade Red Hat OpenShift as part of an upgrade to IBM Cloud Pak for AIOps 4.7.0, then you must complete the IBM Cloud Pak for AIOps upgrade before you upgrade Red Hat OpenShift.
Upgrade procedure
Follow these steps to upgrade your offline IBM Cloud Pak for AIOps deployment.
1. Ensure cluster readiness
Recommended: Take a backup before upgrading. For more information, see Backup and restore.
-
Ensure that your cluster still meets all of the prerequisites for an air-gapped deployment.
Note: IBM Cloud Pak for AIOps 4.7.0 requires 6 CPU more than IBM Cloud Pak for AIOps v4.6 and v4.6.1.
Review the steps for your installation approach:
- Bastion host: Prerequisites
- Portable device: Prerequisites
-
Ensure that the IBM Catalog Management Plug-in for IBM Cloud Pak® (
ibm-pak-plugin
) is at the required level.Important:
ibm-pak-plugin
v1.9 and lower cannot process the IBM Cloud Pak for AIOps 4.7.0 catalog. You must ensure that you haveibm-pak-plugin
v1.10 or higher installed.-
Check which version of
ibm-pak-plugin
you have installed.Run the following command on your bastion host, portable compute device, or connected compute device if you are using a portable storage device.
oc ibm-pak --version
Example output:
oc ibm-pak --version v1.13.0
-
If the
ibm-pak-plugin
version is lower than v1.10.0, then you must download and install the most recent version.Follow the steps for your installation approach:
-
Bastion host: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
Portable device: Install the IBM Catalog Management Plug-in for IBM Cloud Pak®.
-
-
-
Download scripts.
-
Download the prerequisite checker script and copy it to your air-gapped environment.
For more information about the script, including how to download and run it, see github.com/IBM .
-
Download the IBM Cloud Pak for AIOps uninstall script, and copy it to your air-gapped environment.
For more information about the script, including how to download and run it, see github.com/IBM .
-
(Optional) Download the status checker script, and copy it to your air-gapped environment.
For more information about the script, including how to download and run it, see github.com/IBM. The status checker script can be used in step 9. Verify the deployment to give information about the status of your deployment. The use of this script is optional, as status can be found directly from the
ibm-aiops-orchestrator
custom resource.
-
-
If you still have
waiops_var.sh
from when you installed IBM Cloud Pak for AIOps, then run the following command from the directory that the script is in, to set the environment variables that are used later.. ./waiops_var.sh
If you do not have
waiops_var.sh
, then run the following commands to set the environment variables that you need for upgrade.export PROJECT_CP4AIOPS=<project> export INSTALL_MODE_NAMESPACE=<install_namespace>
Where
<project>
is the namespace (project) that your IBM Cloud Pak for AIOps subscription is deployed in.<install_namespace>
is ${PROJECT_CP4AIOPS} if your deployment is namespace scoped, oropenshift-operators
if your deployment has a cluster-wide scope.
-
Run the IBM Cloud Pak for AIOps prerequisite checker script.
Run the prerequisite checker script to ensure that your Red Hat OpenShift Container Platform cluster is correctly set up for an IBM Cloud Pak for AIOps upgrade.
Download the prerequisite checker script from github.com/IBM , and run it with the following command:
./prereq.sh -n ${PROJECT_CP4AIOPS} --ignore-allocated
Important: The prerequisite checker script might show inadequate resources in the Resource Summary because the script does not account for resources already being in use by the upgrading deployment. This can be ignored, as can the following message: [ FAIL ] Small or Large Profile Install Resources.
Example output:
# ./prereq.sh -n cp4aiops --ignore-allocated [INFO] Starting IBM Cloud Pak for AIOps prerequisite checker v4.7... CLI: oc [INFO] =================================Platform Version Check================================= [INFO] Checking Platform Type.... [INFO] You are using Openshift Container Platform [INFO] OCP version 4.16.7 is compatible but only nodes with AMD64 architectures are supported at this time. [INFO] =================================Platform Version Check================================= [INFO] =================================Storage Provider================================= [INFO] Checking storage providers [INFO] No IBM Storage Fusion Found... Skipping configuration check. [INFO] No Portworx StorageClusters found with "Running" or "Online" status. Skipping configuration check for Portworx. [INFO] Openshift Data Foundation found. [INFO] No IBM Cloud Storage found... Skipping configuration check for IBM Cloud Storage Check. Checking Openshift Data Foundation Configuration... Verifying if Red Hat Openshift Data Foundation pods are in "Running" or "Completed" status [INFO] Pods in openshift-storage project are "Running" or "Completed" [INFO] ocs-storagecluster-ceph-rbd exists. [INFO] ocs-storagecluster-cephfs exists. [INFO] No warnings or failures found when checking for Storage Providers. [INFO] =================================Storage Provider================================= [INFO] =================================Cert Manager Check================================= [INFO] Checking for Cert Manager operator [INFO] Successfully functioning cert-manager found. CLUSTERSERVICEVERSION NAMESPACE ibm-cert-manager-operator.v4.2.8 ibm-cert-manager [INFO] =================================Cert Manager Check================================= [INFO] =================================Licensing Service Operator Check================================= [INFO] Checking for Licensing Service operator [INFO] Successfully functioning licensing service operator found. CLUSTERSERVICEVERSION NAMESPACE ibm-licensing-operator.v4.2.8 ibm-licensing [INFO] =================================Licensing Service Operator Check================================= [INFO] =================================Starter or Production Install Resources================================= [INFO] Checking for cluster resources [INFO] ==================================Resource Summary===================================================== [INFO] Nodes | vCPU | Memory(GB) [INFO] Starter (Non-HA) Base (available/required) [ 9 / 3 ] [ 144 / 47 ] [ 289 / 123 ] [INFO] (+ Log Anomaly Detection & Ticket Analysis) [ 9 / 3 ] [ 144 / 55 ] [ 289 / 136 ] [INFO] Production (HA) Base (available/required) [ 9 / 6 ] [ 144 / 136 ] [ 289 / 310 ] [INFO] (+ Log Anomaly Detection & Ticket Analysis) [ 9 / 6 ] [ 144 / 162 ] [ 289 / 368 ] [INFO] ==================================Resource Summary===================================================== [INFO] Cluster currently has resources available to create a Starter (Non-HA) install of Cloud Pak for AIOps [INFO] =================================Prerequisite Checker Tool Summary================================= [ PASS ] Platform Version Check [ PASS ] Storage Provider [ PASS ] Starter (Non-HA) Base Install Resources [ FAIL ] Production (HA) Base Install Resources [ PASS ] Cert Manager Operator Installed [ PASS ] Licensing Service Operator Installed [INFO] =================================Prerequisite Checker Tool Summary=================================
-
Delete any evicted
connector-orchestrator
pods.-
Run the following command to check if there are any evicted
connector-orchestrator
pods.oc get pods -n ${PROJECT_CP4AIOPS} | grep connector-orchestrator
-
Cleanup any evicted connector-orchestrator pods.
If the previous command returned any pods with a STATUS of
Evicted
, then run the following command to delete each of them.oc delete pod -n ${PROJECT_CP4AIOPS} <connector_orchestrator>
Where
<connector_orchestrator>
is a pod returned in the previous step.
-
2. Download CASE files
On your OpenShift cluster, rerun step 2 of the air-gap installation procedure Download CASE files to download the latest CASE files.
Follow the steps for your installation approach:
-
Bastion host: Download the CASE
-
Portable device: Download the CASE
3. Mirror images
Rerun step 3 of the air-gap installation procedure to mirror the updated images to the offline registry.
Follow the steps for your installation approach:
-
Bastion host: Mirror images
-
Portable device: Mirror images
4. Update the catalog
Rerun step 5.1 of the air-gap installation procedure Create the catalog source to update your catalog source.
Follow the steps for your installation approach:
-
Bastion host: Create the catalog source
-
Portable device: Create the catalog source
5. Maintain custom IR Flink configuration
If IBM Sales representatives and Business Partners used the custom sizing tool to supply you with a custom profile ConfigMap that customizes IR FlinkCluster
, then use these steps to update it. Otherwise, skip the rest of this step
and proceed to section 6, Maintain Flink task manager custom configuration.
Before v4.7.0, custom profiles modified the IR FlinkCluster resources through the overrides
and statefulsets
fields. From v4.7.0, a new FlinkDeployment resource is used, and you must add new configuration to your custom
profile in addition to the old configuration before you upgrade.
-
Run the following command to copy the contents of your custom profile ConfigMap into a temporary file called
profiles.yaml
.oc get configmap $(oc get installation -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.items[0].status.customProfileConfigmap}') -o jsonpath='{.data.profiles}' > profiles.yaml
-
Add the following entry under
spec.lifecycleservice.customSizing.deployments
inprofiles.yaml
.- name: flink-taskmanager replicas: 6 containers: - name: flink-main-container limits: cpu: "3000m" ephemeral-storage: "3Gi" memory: "8192Mi" requests: cpu: "1154m" ephemeral-storage: "3Gi" memory: "8192Mi"
Important: The values in this codeblock are an example only. Set these values to match the values for
ir-lifecycle-operator
inspec.lifecycleservice.customSizings.statefuleSets
. -
Add the following entry under spec.lifecycleservice.overrides in
profiles.yaml
.flinkdeployment: properties: taskmanager.memory.task.heap.size: 1638M
Important: The value
1638M
in this codeblock is an example only. Set this value to match the value forir-lifecycle-operator
inspec.lifecycleservice.overrides.eventprocessor.flink.properties
. -
Verify that
profiles.yaml
looks similar to the following example custom profile.... operandconfigs: ... - name: ir-lifecycle-operator spec: lifecycleservice: overrides: ... eventprocessor: flink: properties: taskmanager.memory.task.heap.size: 1638M taskmanager.memory.process.size: 6554M flinkdeployment: properties: taskmanager.memory.task.heap.size: 1638M customSizing: deployments: - name: flink-taskmanager replicas: 6 containers: - name: flink-main-container limits: cpu: "3000m" ephemeral-storage: "3Gi" memory: "8192Mi" requests: cpu: "1154m" ephemeral-storage: "3Gi" memory: "8192Mi" ... statefulSets: - name: eventprocessor-ep-taskmanager replicas: 6 containers: - name: taskmanager requests: cpu: 1154m memory: 8192Mi limits: cpu: 3000m memory: 8192Mi
-
Update your custom profile ConfigMap with the contents of
profiles.yaml
.oc set data configmap $(oc get installation -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.items[0].status.customProfileConfigmap}') --from-file=profiles=profiles.yaml
6. Maintain custom Flink task manager configuration
Use the following steps to determine if your deployment has a custom configuration for the Flink task manager replica count. This would have been configured after installation using the steps in Increasing data streaming capacity.
-
Run the following command to determine if your deployment has a custom configuration for the Flink task manager replica count.
oc get subscriptions.operators.coreos.com ibm-aiops-orchestrator -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.spec.config.env[?(@.name=="FLINK_TASK_MGR_REPLICAS")].value}'
If the command returns nothing, then skip the rest of this step and proceed to section 7, Update the operator subscription.
-
Run the following steps to maintain your custom Flink task manager replica count during upgrade.
If IBM Sales representatives and Business Partners did not use the custom sizing tool to supply you with a custom profile ConfigMap, then run the following command to apply one:
export FLINK_TASK_MGR_REPLICAS=$(oc get subscriptions.operators.coreos.com ibm-aiops-orchestrator -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.spec.config.env[?(@.name=="FLINK_TASK_MGR_REPLICAS")].value}') cat << EOF | oc apply -f - apiVersion: v1 kind: ConfigMap metadata: labels: app.kubernetes.io/instance: aiops app.kubernetes.io/managed-by: ibm-aiops-orchestrator app.kubernetes.io/name: custom-sized-profiles app.kubernetes.io/part-of: ibm-aiops aiops-custom-size-profile-version: 4.7.0 name: aiops-custom-size-profile namespace: ${INSTALL_MODE_NAMESPACE} data: # WARNING: Modifications to this ConfigMap may cause your AIOps installation to become unstable. profiles: | generatedfor: HA cp4waiops-eventprocessor: flink: taskmanager: replicas: ${FLINK_TASK_MGR_REPLICAS} EOF
If IBM Sales representatives and Business Partners used the custom sizing tool to supply you with a custom profile ConfigMap, then run the following commands to update it:
-
Determine the required number of Flink task manager replicas.
oc get subscriptions.operators.coreos.com ibm-aiops-orchestrator -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.spec.config.env[?(@.name=="FLINK_TASK_MGR_REPLICAS")].value}'
-
Run the following command to copy the contents of your custom profile ConfigMap into a temporary file called
profiles2.yaml
.oc get configmap $(oc get installation -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.items[0].status.customProfileConfigmap}') -o jsonpath='{.data.profiles}' > profiles2.yaml
-
Edit
profiles2.yaml
and adjust the value offlink.taskmanager.replicas
to match the value obtained in step 2a.Example excerpt:
cp4waiops-eventprocessor: flink: taskmanager: replicas: <Flink task manager replica count>
Where
<Flink task manager replica count>
is the value obtained in step 2a. -
Update your custom profile ConfigMap with the contents of
profiles2.yaml
.oc set data configmap $(oc get installation -n ${INSTALL_MODE_NAMESPACE} -o jsonpath='{.items[0].status.customProfileConfigmap}') --from-file=profiles=profiles2.yaml
-
7. Update the operator subscription
-
Run the following command to update the IBM Cloud Pak for AIOps subscription to use the new channel, v4.7.
oc patch subscription.operators.coreos.com ibm-aiops-orchestrator -n ${INSTALL_MODE_NAMESPACE} --type=json -p='[{'op': 'replace', 'path': '/spec/channel', 'value': 'v4.7'}]'
-
If you are installing in
AllNamespaces
mode, run the following command to refresh the connectors' secret:oc delete secret cp4waiops-connectors-deploy-cert-secret -n "${PROJECT_CP4AIOPS}"
For more information about installation modes, see Operator installation mode.
8. Verify the deployment
8.1 Check the deployment
Run the following command to check that the PHASE of your deployment is Updating
.
oc get installations.orchestrator.aiops.ibm.com -n ${PROJECT_CP4AIOPS}
Example output:
NAME PHASE LICENSE STORAGECLASS STORAGECLASSLARGEBLOCK AGE
ibm-cp-aiops Updating Accepted rook-cephfs rook-ceph-block 3m
It takes around 60-90 minutes for the upgrade to complete (subject to the speed with which images can be pulled). When installation is complete and successful, the PHASE of your installation changes to Running
.
If your installation phase does not change to Running
, then use the following command to find out which components are not ready:
oc get installation.orchestrator.aiops.ibm.com -o yaml -n ${PROJECT_CP4AIOPS} | grep 'Not Ready'
Example output:
lifecycleservice: Not Ready
zenservice: Not Ready
To see details about why a component is Not Ready
run the following command, where <component>
is the component that is not ready, for example zenservice
.
oc get <component> -o yaml -n ${PROJECT_CP4AIOPS}
(Optional) If you downloaded the status checker script earlier in step 1.5 Ensure cluster readiness, then you can also run this script to see information about the status of your deployment.
If the installation fails, or is not complete and is not progressing, then see Troubleshooting installation and upgrade and Known Issues to help you identify any installation problems.
Important: Wait for the deployment to enter a Running
phase before continuing to the next step.
8.2 Check the version
Run the following command and check that the VERSION that is returned is 4.7.0
.
oc get csv -l operators.coreos.com/ibm-aiops-orchestrator.${INSTALL_MODE_NAMESPACE} -n ${INSTALL_MODE_NAMESPACE}
Example output:
oc get csv -l operators.coreos.com/ibm-aiops-orchestrator.cp4aiops -n cp4aiops
NAME DISPLAY VERSION REPLACES PHASE
ibm-aiops-orchestrator.v4.7.0 IBM Cloud Pak for AIOps 4.7.0 ibm-aiops-orchestrator.v4.6.1 Succeeded
9. Post upgrade actions
-
If you previously set up backup or restore on your deployment, then you must follow the instructions in Upgrading IBM Cloud Pak for AIOps backup and restore artifacts.
-
If the EXPIRY_SECONDS environment variable was set for configuring log anomaly alerts, the environment variable was not retained in the upgrade. After the upgrade is completed, set the environment variable again. For more information about setting the variable, see Configuring expiry time for log anomaly alerts.
-
If you have a metric integration configured that stops working after upgrade, then you must follow the instructions in After upgrade, a metric integration goes into a failed state.
-
If upgrade does not complete because the lifecycletrigger component is stuck, then follow the instructions in Upgrade does not complete because the lifecycletrigger component is stuck.
-
(Optional) You can use the following steps to remove unneccesary data from your Cloud Pak for AIOps environment:
Note: Use the following steps if high availability (HA) is enabled for your Cloud Pak for AIOps deployment.
-
Switch to the project (namespace) where Cloud Pak for AIOps is deployed.
oc project <namespace>
-
Verify the health of your Cloud Pak for AIOps deployment:
oc get installation -o go-template='$i:=index .items 0range $c,$s := $i.status.componentstatus$c": "$s"\n"end'
All the components need to be in
Ready
status. -
Delete the zookeeper data by running the following four commands:
oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/ir-lifecycle oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/ir-lifecycle2 oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/ir-lifecycle3 oc exec iaf-system-zookeeper-0 – /opt/kafka/bin/zookeeper-shell.sh 127.0.0.1:12181 deleteall /flink/aiops/cp4waiops-eventprocessor
-
Delete the Issue Resolution (IR) lifecycle metadata by running the following three commands:
img=$(oc get csv -o jsonpath='{.items[?(@.spec.displayName=="IBM AIOps AI Manager")].spec.install.spec.deployments[?(@.name=="aimanager-operator-controller-manager")].spec.template.metadata.annotations.olm\.relatedImage\.opencontent-minio-client}') minio=$(oc get flinkdeployment aiops-ir-lifecycle-flink -o jsonpath='{.spec.flinkConfiguration.s3\.endpoint}') oc delete job --ignore-not-found aiops-clean-s3 cat <<EOF | oc apply --validate -f - apiVersion: batch/v1 kind: Job metadata: name: aiops-clean-s3 spec: backoffLimit: 6 parallelism: 1 template: metadata: labels: component: aiops-clean-s3 name: clean-s3 spec: affinity: nodeAffinity: requiredDuringSchedulingIgnoredDuringExecution: nodeSelectorTerms: - matchExpressions: - key: kubernetes.io/arch operator: In values: - amd64 containers: - command: - /bin/bash - -c - |- echo "Connecting to Minio server: $minio" try=0 while true; do mc alias set aiopss3 $minio \$(cat /config/accesskey) \$(cat /config/secretkey) if [ \$? -eq 0 ]; then break; fi try=\$(expr \$try + 1) if [ \$try -ge 30 ]; then exit 1; fi sleep 2 done /workdir/bin/mc rm -r --force aiopss3/aiops-ir-lifecycle/high-availability/ir-lifecycle x=$? /workdir/bin/mc ls aiopss3/aiops-ir-lifecycle/high-availability exit $x image: $img imagePullPolicy: IfNotPresent name: clean-s3 resources: limits: cpu: 500m memory: 512Mi requests: cpu: 200m memory: 256Mi securityContext: allowPrivilegeEscalation: false capabilities: drop: - ALL privileged: false readOnlyRootFilesystem: false runAsNonRoot: true volumeMounts: - name: s3-credentials mountPath: /config - name: s3-ca mountPath: /workdir/home/.mc/certs/CAs volumes: - name: s3-credentials secret: secretName: aimanager-ibm-minio-access-secret - name: s3-ca secret: items: - key: ca.crt path: ca.crt secretName: aimanager-certificate-secret restartPolicy: Never serviceAccount: aimanager-workload-admin serviceAccountName: aimanager-workload-admin EOF
-
Check the status of the job:
oc get po -l component=aiops-clean-s3
Verify that the status shows as
Completed
.
-