DataPower Operator Upgrades

If any known issues exist or migration paths are needed to upgrade from one version of the DataPower Operator to another, they will be documented here.

1.2.0

Pod Topology Spread Constraints

Symptoms

DataPower Operator pods fail to schedule, stating that no nodes match pod topology spread constraints (missing required label).

0/15 nodes are available: 12 node(s) didn't match pod topology spread constraints (missing required label), 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.

Description

There was a Kubernetes bug that allowed for pods to schedule when topologyKey was not matched. Kubernetes recently fixed this bug in 1.20, and back-ported the fix to 1.18 and 1.19.

Once this fix is installed to a Kubernetes cluster, the scheduler would no longer schedule DataPower Operator pods, due to the topologyKey in our pod spec not using a well-known "zone" label.

Solution

We have fixed this issue in DataPower Operator version 1.2.1, see Release notes.

Performing a clean installation of DataPower Operator version 1.2.1 or higher should succeed without the below workaround needed. However, if you previously tried to upgrade to 1.2.0 and the upgrade is stuck in a pending state, the following workaround can be used to allow the 1.2.0 install to complete.

Manually add a zone label (with any value) to one of the worker nodes in the Kubernetes cluster. The DataPower Operator pods will then be scheduled to that worker node.

kubectl label nodes <node-name> zone=<label-value>

Reference: Assigning Pods to Nodes

Optionally, once a successful upgrade to 1.2.1 has been completed, the zone label can be removed from the node(s) in the cluster. This zone label is no longer used in 1.2.1 or higher.

Slow Kubernetes garbage collection can cause webhook service creation to fail

Symptoms

An error may be seen that indicates a webhook Service is not found:

failed calling webhook "datapowerservices.defaulter.datapower.ibm.com": Post https://datapower-operator.default.svc:443/default-datapower-ibm-com-v1beta2-datapowerservice?timeout=2s: service "datapower-operator" not found

In the datapower-operator pod logs, the following log message is seen indicating the Service exists already:

{"level":"info","ts":"2020-10-06T20:32:45.818Z","logger":"setup-webhook","msg":"webhook service found. skip create"}

However, checking for the Service object (name datapower-operator) in the namespace would show it does not exist:

$ kubectl get svc datapower-operator

Description

During the DataPower Operator boot the defaulting and validating webhooks are initialized. As part of this process, a Service object is created if not found in the cluster. If the Kubernetes garbage collection is slow or delayed it is possible for the Service object to have been marked for deletion, but still exist in the namespace for a span of time. If the new DataPower Operator pod attempts to initialize the new Service resource during this time window, the operator will continue through the boot sequence without creating a new Service instance.

Once the Kubernetes garbage collection catches up, and the Service is deleted, no Service will remain to expose the webhooks and thus API errors will be seen when invoking the webhooks.

Solution

To resolve this issue, the DataPower Operator pod can be deleted manually. Once this is done, the Deployment's ReplicaSet controller will spin up a new DataPower Operator pod, which will in turn create the webhook Service.

Fetch the DataPower Operator pod, taking note of the name (will be the first column in the output).
```
 kubectl [-n namespace] get pod | grep datapower-operator
```
Delete the DataPower Operator pod, where name is the name of the pod found in Step 5.
```
 kubectl [-n namespace] delete pod/name
```

Validate a new pod is created in its place.

 kubectl [-n namespace] get pod | grep datapower-operator

1.1.0

Operator lock not released after Leader Pod removed

Symptoms

Immediately following an upgrade from 1.0.X to 1.1.X, you may see the following error when attempting to interact with a DataPowerService CR:

Error from server: conversion webhook for datapower.ibm.com/v1beta1, Kind=DataPowerService failed: Post https://changeme.default.svc:443/?timeout=30s: service "changeme" not found

Failure to reconcile changes made to a DataPowerService CR after a failover event caused the previous DataPower Operator Leader Pod to be rescheduled.

Description

Occasionally, when the DataPower Operator Leader Pod is removed, the new DataPower Operator Pod will show as being Ready but isn't doing anything. In the logs you might see

{"level":"info","ts":"2020-09-08T19:29:53.432Z","logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":"2020-09-08T19:29:57.971Z","logger":"leader","msg":"Leader pod has been deleted, waiting for garbage collection do remove the lock."}

followed by additional waiting. In this case, Kubernetes' garbage collection failed to clean up the datapower-operator-lock after the old Leader Pod was removed. This stops the new DataPower Operator Pod from completing initialization tasks such as creating the conversion webhook.

Solution

To resolve this issue, you must manually remove the lock resource. This can be done with

kubectl delete cm datapower-operator-lock

Once the lock is removed, the new DataPower Operator pod will begin its initialization.

1.0.1

Invalid value for `spec.selector`

When attempting to ugprade from 1.0.0 to 1.0.1 through the Operator Lifecycle Manager, an error will likely be seen that the installPlan failed.

install strategy failed: Deployment.apps "datapower-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{...}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable

This error occurs because between version 1.0.0 and 1.0.1 a new label was added to the DataPower Operator Deployment resource to fix an issue related to operator-metrics. However, the install plan is not able to resolve this update because the LabelSelector is an immutable field.

Resolution

To workaround this issue, you can manually delete the existing datapower-operator Deployment resource. The Operator Lifecycle Manager should then recreate the datapower-operator Deployment resource with the 1.0.1 spec, and the install plan should succeed.

oc delete deployment datapower-operator

Once done, validate a new datapower-operator Deployment is created:

oc get deployment

Then validate the 1.0.1 install plan succeeds by checking the ClusterServiceVersion resource:

$ oc get csv
NAME                        DISPLAY                 VERSION   REPLACES                    PHASE
datapower-operator.v1.0.1   IBM DataPower Gateway   1.0.1     datapower-operator.v1.0.0   Succeeded

DataPower Operator Upgrades

1.2.0

Pod Topology Spread Constraints

Symptoms

Description

Solution

Slow Kubernetes garbage collection can cause webhook service creation to fail

Symptoms

Description

Solution

1.1.0

Operator lock not released after Leader Pod removed

Symptoms

Description

Solution

1.0.1

Invalid value for spec.selector

Resolution

Invalid value for `spec.selector`