DataPower Operator Upgrades
If any known issues exist or migration paths are needed to upgrade from one version of the DataPower Operator to another, they will be documented here.
1.2.0
Pod Topology Spread Constraints
Symptoms
DataPower Operator pods fail to schedule, stating that no nodes match pod topology spread constraints (missing required label).
0/15 nodes are available: 12 node(s) didn't match pod topology spread constraints (missing required label), 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate.
Description
There was a Kubernetes bug that allowed for pods to schedule when topologyKey
was not matched. Kubernetes recently fixed this bug in 1.20, and back-ported the fix to
1.18 and 1.19.
Once this fix is installed to a Kubernetes cluster, the scheduler would no longer schedule DataPower Operator pods, due to the topologyKey
in our pod spec not using a well-known "zone" label.
Solution
We have fixed this issue in DataPower Operator version 1.2.1
, see Release notes.
Performing a clean installation of DataPower Operator version 1.2.1
or higher should succeed without the below workaround needed. However, if you previously tried to upgrade to 1.2.0
and the upgrade is stuck in a pending
state, the following workaround can be used to allow the 1.2.0
install to complete.
Manually add a zone
label (with any value) to one of the worker nodes in the Kubernetes cluster. The DataPower Operator pods will then be scheduled to that worker node.
kubectl label nodes <node-name> zone=<label-value>
Reference: Assigning Pods to Nodes
Optionally, once a successful upgrade to 1.2.1
has been completed, the zone
label can be removed from the node(s) in the cluster. This zone
label is no longer used in 1.2.1
or higher.
Slow Kubernetes garbage collection can cause webhook service creation to fail
Symptoms
An error may be seen that indicates a webhook Service
is not found:
failed calling webhook "datapowerservices.defaulter.datapower.ibm.com": Post https://datapower-operator.default.svc:443/default-datapower-ibm-com-v1beta2-datapowerservice?timeout=2s: service "datapower-operator" not found
In the datapower-operator
pod logs, the following log message is seen indicating the Service
exists already:
{"level":"info","ts":"2020-10-06T20:32:45.818Z","logger":"setup-webhook","msg":"webhook service found. skip create"}
However, checking for the Service
object (name datapower-operator
) in the namespace would show it does not exist:
$ kubectl get svc datapower-operator
Description
During the DataPower Operator boot the defaulting and validating webhooks are initialized. As part of this process, a Service
object is created if not found in the cluster. If the Kubernetes garbage collection is slow or delayed it
is possible for the Service
object to have been marked for deletion, but still exist in the namespace for a span of time. If the new DataPower Operator pod attempts to initialize the new Service
resource during this time
window, the operator will continue through the boot sequence without creating a new Service
instance.
Once the Kubernetes garbage collection catches up, and the Service
is deleted, no Service
will remain to expose the webhooks and thus API errors will be seen when invoking the webhooks.
Solution
To resolve this issue, the DataPower Operator pod can be deleted manually. Once this is done, the Deployment's ReplicaSet controller will spin up a new DataPower Operator pod, which will in turn create the webhook Service
.
-
Fetch the DataPower Operator pod, taking note of the name (will be the first column in the output).
kubectl [-n namespace] get pod | grep datapower-operator
-
Delete the DataPower Operator pod, where
name
is the name of the pod found in Step 5.kubectl [-n namespace] delete pod/name
-
Validate a new pod is created in its place.
kubectl [-n namespace] get pod | grep datapower-operator
1.1.0
Operator lock not released after Leader Pod removed
Symptoms
- Immediately following an upgrade from 1.0.X to 1.1.X, you may see the following error when attempting to interact with a DataPowerService CR:
Error from server: conversion webhook for datapower.ibm.com/v1beta1, Kind=DataPowerService failed: Post https://changeme.default.svc:443/?timeout=30s: service "changeme" not found
- Failure to reconcile changes made to a DataPowerService CR after a failover event caused the previous DataPower Operator Leader Pod to be rescheduled.
Description
Occasionally, when the DataPower Operator Leader Pod is removed, the new DataPower Operator Pod will show as being Ready but isn't doing anything. In the logs you might see
{"level":"info","ts":"2020-09-08T19:29:53.432Z","logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":"2020-09-08T19:29:57.971Z","logger":"leader","msg":"Leader pod has been deleted, waiting for garbage collection do remove the lock."}
followed by additional waiting. In this case, Kubernetes' garbage collection failed to clean up the datapower-operator-lock
after the old Leader Pod was removed. This stops the new DataPower Operator Pod from completing initialization
tasks such as creating the conversion webhook.
Solution
To resolve this issue, you must manually remove the lock resource. This can be done with
kubectl delete cm datapower-operator-lock
Once the lock is removed, the new DataPower Operator pod will begin its initialization.
1.0.1
Invalid value for spec.selector
When attempting to ugprade from 1.0.0
to 1.0.1
through the Operator Lifecycle Manager, an error will likely be seen that the installPlan
failed.
install strategy failed: Deployment.apps "datapower-operator" is invalid: spec.selector: Invalid value: v1.LabelSelector{...}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: field is immutable
This error occurs because between version 1.0.0
and 1.0.1
a new label
was added to the DataPower Operator Deployment resource to fix an issue related to operator-metrics. However, the install plan is not able
to resolve this update because the LabelSelector is an immutable field.
Resolution
To workaround this issue, you can manually delete the existing datapower-operator
Deployment resource. The Operator Lifecycle Manager should then recreate the datapower-operator
Deployment resource with the 1.0.1
spec, and the install plan should succeed.
oc delete deployment datapower-operator
Once done, validate a new datapower-operator
Deployment is created:
oc get deployment
Then validate the 1.0.1
install plan succeeds by checking the ClusterServiceVersion
resource:
$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE
datapower-operator.v1.0.1 IBM DataPower Gateway 1.0.1 datapower-operator.v1.0.0 Succeeded