Troubleshooting installation on OpenShift

Review the following known issues and troubleshooting tips if you encounter a problem while installing API Connect on OpenShift, including as a component of IBM Cloud Pak for Integration (CP4I).

Note: In the The Help icon.

Help page of the Cloud Manager, API Manager, and API Designer user interfaces, there's a Product information tile that you can click to find out information about your product versions, as well as Git information about the package versions being used. Note that the API Designer product information is based on its associated management server, but the Git information is based on where it was downloaded from.

V10.0.7.0: Failed `integration-ibm-cloud-native-postgresql` `CatalogSource` on ROKS 4.14 and OpenShift Container Platform 4.15

The API Connect operator creates the EDB catalog source in the same namespace as the API Connect operator.

The EDB catalog source reports a failed state and shows the following status message:

Status:
  Message:  couldn't ensure registry server - error ensuring pod: : error creating new pod: integration-ibm-cloud-native-postgresql-: pods "integration-  ibm-cloud-native-postgresql-hnjbn" is forbidden: violates PodSecurity "restricted:v1.24": allowPrivilegeEscalation != false (container "registry-server"   must set securityContext.allowPrivilegeEscalation=false), unrestricted capabilities (container "registry-server" must set   securityContext.capabilities.drop=["ALL"]), runAsNonRoot != true (pod or container "registry-server" must set securityContext.runAsNonRoot=true),   seccompProfile (pod or container "registry-server" must set securityContext.seccompProfile.type to "RuntimeDefault" or "Localhost")
  Reason:   RegistryServerError

The problem occurs when the namespace is set to enforce the restricted pod security admission policy with the pod-security.kubernetes.io/enforce: restricted label.

ROKS 4.14 and some OpenShift Container Platform versions such as 4.15 have enforce set to restricted.

To workaround this problem, update the integration-ibm-cloud-native-postgresql CatalogSource to use the restricted security context constraint:

oc patch CatalogSource integration-ibm-cloud-native-postgresql --type merge --patch '{"spec":{"grpcPodConfig":{"securityContextConfig":"restricted"}}}'

One or more pods in `CrashLoopBackoff` or `Error` state, and report a certificate error in the logs

In rare cases, cert-manager might detect a certificate in a bad state right after it has been issued, and then re-issues the certificate. If a CA certificate has been issued twice, the certificate that was signed by the previously issued CA will be left stale and can't be validated by the newly issued CA. In this scenario, one of the following messages displays in the log:

javax.net.ssl.SSLHandshakeException: Received fatal alert: certificate_unknown

Error: unable to verify the first certificate

ERROR: openssl verify failed to verify the Portal CA tls.crt, ca.crt chain signed the Portal Server tls.crt cert

Resolve the problem by completing the following steps:

Use apicops (v10 version 0.10.57+ required) to validate the certificates in the system:
```
apicops upgrade:stale-certs -n <namespace>
```
If any certificate that is managed by cert-manager fails the validation, delete the stale certificate secret:
```
oc delete secret <stale-secret> -n <namespace>
```
Cert-manager automatically generates a new certificate to replace the one you deleted.
Use apicops to make sure all certificates can be verified successfully:
```
apicops upgrade:stale-certs -n <namespace>
```

You see the `denied: insufficient scope` error during an air-gapped deployment

Problem: You encounter the denied: insufficient scope message while mirroring images during an air-gapped installation or upgrade.

Reason: This error occurs when a problem is encountered with the entitlement key that is used for obtaining images.

Solution: Obtain a new entitlement key by completing the following steps:

Log in to the IBM Container Library.
In the Container software library, select Get entitlement key.
After the Access your container software heading, click Copy key.
Copy the key to a safe location.

Apiconnect operator pod fails

Problem: During installation (or upgrade), the apiconnect operator fails with the following message:

panic: unable to build API support: unable to get Group and Resources: unable to retrieve the complete list of server APIs: packages.operators.coreos.com/v1: the server is currently unable to handle the request

goroutine 1 [running]:
github.ibm.com/velox/apiconnect-operator/operator-utils/v2/apiversions.GetAPISupport(0x0)
	operator-utils/v2/apiversions/api-versions.go:89 +0x1e5
main.main()
	ibm-apiconnect/cmd/manager/main.go:188 +0x4ee

Additional symptoms:

Apiconnect operator is in crash loopback status

Kube apiserver pods log the following information:

E1122 18:02:07.853093 18 available_controller.go:437] v1.packages.operators.coreos.com failed with:
 failing or missing response from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1:
 bad status from https://10.128.0.3:5443/apis/packages.operators.coreos.com/v1: 401

The IP logged here belongs to the package server pod present in the openshift-operator-lifecycle-manager namespace

Package server pods log the following error message:

/apis/packages.operators.coreos.com/v1 API call is being rejected with 401
issue

E1122 18:10:25.614179 1 authentication.go:53] Unable to authenticate the request due to an error: x509: 
certificate signed by unknown authority I1122 18:10:25.614224 1 httplog.go:90] 
verb="GET" URI="/apis/packages.operators.coreos.com/v1" latency=161.243µs resp=401 
UserAgent="Go-http-client/2.0" srcIP="10.128.0.1:41370":

The problem is intermittent

Solution:

If you find the exact symptoms as described, the solution is to delete package server pods in the openshift-operator-lifecycle-manager namespace.
New package server pods log the 200 Success message for the same API call.

Disabling the Portal web endpoint check

When you create or register a Developer Portal service, the Portal subsystem checks that the Portal web endpoint is accessible. However sometimes, for example due to the complexity of public and private networks, the endpoint cannot be reached. The following example shows the errors that you might see in the portal-www pod, admin container logs, if the endpoint cannot be reached:

An error occurred contacting the provided portal web endpoint: example.com
The provided Portal web endpoint example.com returned HTTP status code 504

In this instance, you can disable the Portal web endpoint check so that the Developer Portal service can be created successfully.

To disable the endpoint check, complete the following update:

On Kubernetes, OpenShift, and IBM® Cloud Pak for Integration

Add the following section to the Portal custom resource (CR) template:

spec:
  template:
  - containers:
    - env:
      - name: PORTAL_SKIP_WEB_ENDPOINT_VALIDATION
        value: "true"
      name: admin
    name: www