IBM Support

Upgrade of InfoSphere Information Server microservices tier multi-node to 11.7.1.4 may fail while waiting for roll out due to pods stuck in ContainerCreating state

Troubleshooting


Problem

When upgrading to version 11.7.1.4, IBM Information Server Microservices tier installer migrates all Kubernetes cluster nodes from Docker to containerd runtime. For a multi-node environment, this involves resetting and rejoining workers. It has been observed that the worker reset may fail to clean up cni0 network interface on a random basis, leading to further problems in the upgrade flow.

Symptom

Multi-replica services, such as Kafka, Solr or Zookeeper, may show pods stuck in ContainerCreating state. When introspecting such pods with "kubectl describe pod POD_NAME", one can see the following or similar error message:
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "4ae1e49fc763f2062e064cb26cc71607cbc7de2dbdc1b627eb6c57176a7245b1": plugin type="flannel" failed (add): failed to delegate add: failed to set bridge addr: "cni0" already has an IP address different from 10.32.1.1/24
Furthermore, the upgrade may fail with the following or similar error:
fatal: [deployment_coordinator]: FAILED! => {"attempts": 150, "changed": false, "cmd": ["kubectl", "rollout", "status", "statefulset", "solr", "-w=false", "-n", "default"], "delta": "0:00:00.155591", "end": "2022-09-26 10:58:04.739568", "rc": 0, "start": "2022-09-26 10:58:04.583977", "stderr": "", "stderr_lines": [], "stdout": "Waiting for 2 pods to be ready...", "stdout_lines": ["Waiting for 2 pods to be ready..."]}

Cause

The worker reset procedure involves running the following command:
ip link delete cni0
Due to unknown reasons, that command sometimes fails to delete the interface, while at the same time reporting success. This causes microservices tier installer not to catch the error early and to continue with the upgrade process, ultimately causing IP address range conflict in the Flannel service and leading to further errors as shown above.

Environment

Information Server 11.7.1.0 to 11.7.1.3 with multi-node Microservices tier to be upgraded to 11.7.1.4.

Diagnosing The Problem

The output of "kubectl get nodes -owide" command may show different node versions and/or the container runtime:
$ kubectl get nodes -owide
NAME                               STATUS   ROLES           AGE      VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                   KERNEL-VERSION                CONTAINER-RUNTIME
controlplane.example.com           Ready    control-plane   3y157d   v1.24.2   10.1.1.1      <none>        Red Hat Enterprise Linux   3.10.0-1160.76.1.el7.x86_64   containerd://1.6.6
worker1.example.com                Ready    <none>          27h      v1.21.3   10.1.1.2      <none>        Red Hat Enterprise Linux   3.10.0-1160.76.1.el7.x86_64   docker://20.10.3
worker2.example.com                Ready    <none>          27h      v1.21.3   10.1.1.3      <none>        Red Hat Enterprise Linux   3.10.0-1160.66.1.el7.x86_64   docker://20.10.3

Resolving The Problem

The cni0 interface deletion problem can be solved by resetting and rejoining workers once more. To do this:
  1. Log in to the Microservices tier control plane operating system terminal as the user who installed Microservices tier.
  2. Change your current working directory to the Microservices tier installation directory, e.g.:
    cd /opt/IBM/UGinstall/ugdockerfiles
  3. Reset workers with use of the following command:
    ./run_playbook.sh playbooks/platform/kubernetes/reset_workers.yaml -y
  4. Log in to each of the Microservices tier workers operating system terminal as root and verify that neither cni0 nor flannel.1 interface exists by examining the output of either ip a command or ifconfig -a command. In case any of the interfaces still exist, delete them manually using ip link delete command:
    ip link delete cni0
    ip link delete flannel.1
  5. Back on the control plane machine, run the following command to rejoin workers:
    ./run_playbook.sh playbooks/platform/kubernetes/join_workers.yaml -y
After rejoining workers, verify that all of the nodes use containerd runtime by viewing the output of kubectl get nodes -owide command. In case there are still nodes using Docker runtime, run the following commands on the control plane node to migrate them:
./run_playbook.sh playbooks/platform/kubernetes/reset_workers.yaml -y
./run_playbook.sh playbooks/upgrade/uninstall_docker.yaml -y
./run_playbook.sh playbooks/install/install_pkg_repos.yaml -y
./run_playbook.sh playbooks/platform/kubernetes/setup_containerd.yaml --limit=workers -y
./run_playbook.sh playbooks/platform/kubernetes/join_workers.yaml -y

Finally, in case the kubectl get nodes -owide command reports version inconsistency for cluster nodes, re-run upgrade.sh script to have the cluster nodes be upgraded.

Document Location

Worldwide

[{"Type":"MASTER","Line of Business":{"code":"LOB10","label":"Data and AI"},"Business Unit":{"code":"BU059","label":"IBM Software w\/o TPS"},"Product":{"code":"SSZJPZ","label":"IBM InfoSphere Information Server"},"ARM Category":[{"code":"a8m0z0000001i9oAAA","label":"Microservices Tier and Kubernetes Issues"}],"ARM Case Number":"","Platform":[{"code":"PF016","label":"Linux"}],"Version":"11.7.1"}]

Document Information

Modified date:
14 October 2022

UID

ibm16824799